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Abstract: We consider the problem of estimating the conditional mean of a real Gaussian 
variable Y = ^i^i + e where the vector of the covariates pQ)i<j<p follows a joint Gaussian 

distribution. This issue often occurs when one aims at estimating the graph or the distribution of a 
Gaussian graphical model. We introduce a general model selection procedure which is based on the 
minimization of a penalized least-squares type criterion. It handles a variety of problems such as 
ordered and complete variable selection, allows to incorporate some prior knowledge on the model 
and applies when the number of covariates p is larger than the number of observations n. Moreover, 
it is shown to achieve a non-asymptotic oracle inequality independently of the correlation structure 
of the covariates. We also exhibit various minimax rates of estimation in the considered framework 
and hence derive adaptiveness properties of our procedure. 
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Selection de modeles en grande dimension pour des design 

gaussiens 

Resume : We consider the problem of estimating the conditional mean of a real Gaussian variable 
Y = Y^!=i^i-^i + e where the vector of the covariates (Xj)i<j< p follows a joint Gaussian distri- 
bution. This issue often occurs when one aims at estimating the graph or the distribution of a 
Gaussian graphical model. We introduce a general model selection procedure which is based on the 
minimization of a penalized least squares type criterion. It handles a variety of problems such as 
ordered and complete variable selection, allows to incorporate some prior knowledge on the model 
and applies when the number of covariates p is larger than the number of observations n. Moreover, 
it is shown to achieve a non-asymptotic oracle inequality independently of the correlation structure 
of the covariates. We also exhibit various minimax rates of estimation in the considered framework 
and hence derive adaptivity properties of our procedure. 

Mots-cles : Selection de modeles, regression lineaire, inegalites oracles, modeles graphiques 
gaussiens, vitesse minimax d'estimation 
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1 Introduction 

1.1 Regression model 

We consider the following regression model 

Y = X6 + e, (1) 

where 9 is an unknown vector of W. The row vector X := (Xi)i<i< p follows a real zero mean Gaus- 
sian distribution with non singular covariance matrix S and e is a real zero mean Gaussian random 
variable independent of X with variance a 2 . The variance of e corresponds to the conditional vari- 
ance of Y given X, V&r(Y\X). In the sequel, the parameters 6, E, and a 2 are considered as unknown. 



Suppose we are given n i.i.d. replications of the vector (Y, X). We respectively write Y and X 
for the vector of n observations of Y and the nxp matrix of observations of X. In the present work, 
we propose a new procedure to estimate the vector 9, when the matrix E and the variance a 2 are 
both unknown. This corresponds to estimating the conditional expectation of the variable Y given 
the random vector X. Besides, we want to handle the difficult case of high-dimensional data, i.e. 
the number of covariates p is possibly much larger than n. This estimation problem is equivalent 
to building a suitable predictor of Y given the covariates pQ)i<j< p , Classically, we shall use the 
mean-squared prediction error to assess the quality of our estimation. For any (0i,02) G K p , it is 
defined by 



J(0i,0 2 ) :=! (X9 1 -X9 2 ) 



(2) 



1.2 Applications to Gaussian graphical models (GGM) 

Estimation in the regression model ([I]) is mainly motivated by the study of Gaussian graphical 
models (GGM). Let Z be a Gaussian random vector indexed by the elements of a finite set F. 
The vector Z is a GGM with respect to an undirected graph Q = (Y,E) if for any couple 
which is not contained in the edge set E, Zi and Zj are independent, given the remaining variables. 
See Lauritzen [23] for definitions and main properties of GGM. Estimating the neighborhood of a 
given point i € Y is equivalent to estimating the support of the regression of Zi with respect to 
the covariates {Zj)j eT \{i}- Meinshausen and Biihlmann have taken this point of view in order 
to estimate the graph of a GGM. Similarly, we can apply the model selection procedure we shall 
introduce in this paper to estimate the support of the regression and therefore the graph Q of a 
GGM. 

Interest in these models has grown since they allow the description of dependence structure of 
high-dimensional data. As such, they are widely used in spatial statistics [la . l27l | or probabilistic 
expert systems [l3|. More recently, they have been applied to the analysis of microarray data. The 
challenge is to infer the network regulating the expression of the genes using only a small sample 
of data, see for instance Schafer and Strimmer , or Wille et al. (S^ . 

This has motivated the search for new estimation procedures to handle the linear regression 
model {1} with Gaussian random design. Finally, let us mention that the model |T]) is also of 
interest when estimating the distribution of directed graphical models or more generally the joint 
distribution of a large Gaussian random vector. Estimating the joint distribution of a Gaussian 
vector {Zi)i<i< p indeed amounts to estimating the conditional expectations and variance of Zi 
given (Zj)i<j<i-i for any 1 < i <p. 
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1.3 General oracle inequalities 

Estimation of high-dimensional Gaussian linear models has now attracted a lot of attention. Var- 
ious procedures have been proposed to perform the estimation of 9 when p > n. The challenge at 
hand it to design estimators that are both computationally feasible and are proved to be efficient. 
The Lasso estimator has been introduced by Tibshirani [33J . Meinshausen and Biihlmann [2Q| have 
shown that this estimator is consistent under a neighborhood stability condition. These conver- 
gence results were refined in the works of Zhao and Yu [13], Bunea et al. fill . Bickel et at [H], or 
Candes and Plan [12] in a slightly different framework. Candes and Tao [l3l| have also introduced 
the Dantzig-selector procedure which performs similarly as l\ penalization methods. In the more 
specific context of GGM, Biihlmann and Kalisch [2l| have analyzed the PC algorithm and have 
proven its consistency when the GGM follows a faithfulness assumption. All these methods share 
an attractive computational efficiency and most of them are proven to converge at the optimal rate 
when the covariates are nearly independent. However, they also share two main drawbacks. First, 
the h estimators are known to behave poorly when the covariates are highly correlated and even 
for some covariance structures with small correlation (see e.g. [12]). Similarly, the PC algorithm is 
not consistent if the faithfulness assumption is not fulfilled. Second, these procedures do not allow 
to integrate some biological or physical prior knowledge. Let us provide two examples. Biologists 
sometimes have a strong preconception of the underlying biological network thanks to previous 
experimentations. For instance, Sachs et al. [28]) have produced multivariate flow cytometry data 
in order to study a human T cell signaling pathway. Since this pathway has important medical 
implications, it was already extensively studied and a network is conventionally accepted (see [28]). 
For this particular example, it could be more interesting to check whether some interactions were 
forgotten or some unnecessary interactions were added in the model than performing a complete 
graph estimation. Moreover, the covariates have in some situations a temporal or spatial inter- 
pretation. In such a case, it is natural to introduce an order between the covariates, by assuming 
that a covariate which is close (in space or time) to the response Y is more likely to be significant. 
Hence, an ordered variable selection method is here possibly more relevant than the complete vari- 
able selection methods previously mentioned. 

Let us emphasize the main differences of our estimation setting with related studies in the 
literature. Birge and Massart Q consider model selection in a fixed design setting with known 
variance. Bunea et al. [l(J also suppose that the variance is known. Yet, they consider a random 
design setting, but they assume that the regression functions are bounded (Assumption A. 2 in their 
paper) which is not the case here. Moreover, they obtain risk bounds with respect to the empirical 
norm ||X(0 — and not the integrated loss £(., .). Here, |j.|j„ refers to the canonical norm in K" 
reweighted by \fn. As mentioned earlier, our objective is to infer the conditional expectation of Y 
given X. Hence, it is more significant to assess the risk with respect to the loss .). Baraud et 
al. 3 consider fixed design regression but do not assume that the variance is known. 

Our objective is twofold. First, we introduce a general model selection procedure that is very 
flexible and allows to integrate any prior knowledge on the regression. We prove non-asymptotic 
oracle inequalities that hold without any assumption on the correlation structure between the co- 
variates. Second, we obtain non-asymptotic rates of estimation for our model {T]) that help us to 
derive adaptive properties for our criterion. 
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In the sequel, a model m stands for a subset of {1, ... ,p}. We note d m the size of m whereas the 
linear space S m refers to the set of vectors 9 £ R p whose components outside m equal zero. If d m 
is smaller than n, then we define 9 m as the least-square estimator of 9 over S m . In the sequel, II m 
stands for the projection of R™ into the space generated by (Xj)j £rn . Hence, we have the relation 
X0 m — II m Y. Since the covariance matrix E is non singular, observe that almost surely the rank of 
II m is d m - Given a collection M of models, our purpose is to select a model meM that exhibits a 
risk as small as possible with respect to the prediction loss function /(.,.) defined in {2j. The model 
m* that minimizes the risks M[l(9 m ,6)] over the whole collection M. is called an oracle. Hence, we 
want to perform as well as the oracle 9 m + . However, we do not have access to m* as it requires the 
knowledge of the true vector 9. A classical method to estimate a good model m is achieved through 
penalization with respect to the complexity of models. In the sequel, we shall select the model m 
as 



where pen(.) is a positive function defined on M.. Besides, we recall that refers to the canon- 
ical norm in R™ reweighted by ^/n. Observe that Crit{m) is the sum of the least-square error 
|| Y — n m Y||„ and a penalty term pen(m) rescaled by the least-square error in order to come up 
with the fact that the conditional variance a 2 is unknown. We precise in Section [2] the heuristics 
underlying this model selection criterion. Baraud et al. [3] have extensively studied this penaliza- 
tion method in the fixed design Gaussian regression framework with unknown variance. In their 
introduction, they explain how one may retrieve classical criteria like AIC 0], BIC ||30], and FPE 
[H by choosing a suitable penalty function pen(.). 

This model selection procedure is really flexible through the choices of the collection M and of 
the penalty function pen{.). Indeed, we may perform complete variable selection by taking the col- 
lection of subsets of {1, ... ,p} whose is smaller than some integer d. Otherwise, by taking a nested 
collection of models, one performs ordered variable selection. We give more details in Sections [2] 
and [31 If one has some prior idea on the true model m, then one could only consider the collection 
of models that are close in some sense to m. Moreover, one may also give a Bayesian flavor to the 
penalty function pen{.) and hence specify some prior knowledge on the model. 

First, we state a non-asymptotic oracle inequality when the complexity of the collection M. is 
small and for penalty functions pen{m) that are larger than Kd m /(n — d m ) with K > 1. Then, 
we prove that the FPE criterion of Akaike [l| which corresponds to the choice K = 2 achieves an 
asymptotic exact oracle inequality for the special case of ordered variable selection. For the sake of 
completeness, we prove that choosing K smaller than one yields to terrible performances. 

In Section [331 we consider general collection of models M.. By introducing new penalties that 
take into account the complexity of M as in [9| , we are able to state a non-asymptotic oracle in- 
equality. In particular, we consider the problem of complete variable selection. In Section 13.41 we 
define penalties based on a prior distribution on M. We then derive the corresponding risk bounds. 

Interestingly, these rates of convergence do not depend on the covariance matrix X of the co- 
variates, whereas known results on the Lasso or the Dantzig selector rely on some assumptions on 
S, as discussed in Section I5~2l We illustrate in Section [5] on simulated examples that for some 



fh := arg min Crit(m) := arg min ||Y 




n m Y||2[l +pen(m)} , 



(3) 
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covariance matrices £ the Lasso performs poorly whereas our methods still behaves well. Besides, 
our penalization method does not require the knowledge of the conditional variance a 2 . In contrast, 
the Lasso and the Dantzig selector are constructed for known variance. Since a 2 is unknown, one 
either has to estimate it or has to use a cross-validation method in order to calibrate the penalty. 
In both cases, there is some room for improvements for the practical calibration of these estimators. 

However, our model selection procedure suffers from a computational cost that depends linearly 
on the size of the collection A4. For instance, the complete variable selection problem is NP-hard. 
This makes it intractable when p becomes too large (i.e. more than 50). In contrast, our criterion 
applies for arbitrary p when considering ordered variable selection since the size of M is linear with 
n. We shall mention in the discussion some possible extensions that we hope can cope with the 
computational issues. 

In a simultaneous and independent work to ours, Giraud [ijj applies an analogous procedure to 
estimate the graph of a GGM. Using slightly different techniques, he obtains non-asymptotic results 
that are complementary to ours. However, he performs an unnecessary thresholding to derive an 
upper bound of the risk. Moreover, he does not consider the case of nested collections of models as 
we do in Section I3TT1 Finally, he does not derive minimax rates of estimation. 



1.4 Minimax rates of estimation 

In order to assess the optimality of our procedure, we investigate in Section 0] the minimax rates of 
estimation for ordered and complete variable selection. For ordered variable selection, we compute 
the minimax rate of estimation over ellipsoids which is analogous to the rate obtained in the fixed 
design framework. We derive that our penalized estimator is adaptive to the collection of ellipsoids 
independently of the covariance matrix S. For complete variable selection, we prove that the 
minimax rates of estimator of vectors 9 with at most k non-zero components is of order fcl ° gp when 
the covariates are independent. This is again coherent with the situation observed in the fixed 
design setting. Then, the estimator 8 defined for complete variable selection problem is shown to 
be adaptive to any sparse vector 0. Moreover, it seems that the minimax rates may become faster 
when the matrix £ is far from identity. We investigate this phenomenon in Section 14.21 All these 
minimax rates of estimation are, to our knowledge, new in the Gaussian random design regression. 



Tsybakov 35] has derived minimax rates of estimation in a general random design regression setup, 



but his results do not apply in our setting as explained in Section I4~2 



1.5 Organization of the paper and some notations 

In Section [2j we precise our estimation procedure and explain the heuristics underlying the penal- 
ization method. The main results are stated in Section [3J In Section [4j we derive the different 
minimax rates of estimation and assess the adaptivity of the penalized estimator 6*^ . We perform 
a simulation study and compare the behaviour of our estimator with Lasso and adaptive Lasso 
in Section [H Section [6] contains a final discussion and some extensions, whereas the proofs are 
postponed to Section [7l 

Throughout the paper, ||.|)^ stands for the square of the canonical norm in R™ reweighted by n. 
For any vector Z of size n, we recall that H m Z denotes the orthogonal projection of Z onto the space 
generated by (Xi)i Gm . The notation X m stands for (Xi)i €m and X m represents the nx d m matrix 
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of the n observations of X m . For the sake of simplicity, we write 9 for the penalized estimator 9 m . 
For any x > 0, [^J is the largest integer smaller than x and \x] is the smallest integer larger than 
x. Finally, L, L\, L 2 ,- ■ ■ denote universal constants that may vary from line to line. The notation 
L(.) specifies the dependency on some quantities. 

2 Estimation procedure 

Given a collection of models Ai and a penalty pen : Ai — > M + , the estimator 9 is computed as 
follows: 



Model selection 


procedure 








1. Compute 9 m 


= argmin e / e5m 


\\Y- 


X9'\ 


„ for all models m e A4. 


2. Compute m 


= argmin me 7w 


|Y- 




\l [l+pen(m)]. 


3. 9 := 9 m . 











The choice of the collection Ai and the penalty function pen(.) depends on the problem under 
study. In what follows, we provide some preliminary results for the parametric estimators 9 m and 
we give an heuristic explanation for our penalization method. 

For any vector 9' in W, we define the mean-squared error 7(.) and its empirical counterpart 

7n(-) as 



7 (0') := Eg [(Y - Xey\ and ln {6') := ||Y — X^H; . (4) 

The function 7(.) is closely connected to the loss function /(.,.) through the relation l(/3,9) = 
7(0) -7(0)- 

Given a model m of size strictly smaller than n, we refer to 9 m as the unique minimizer of 
7(.) over the subset S m . It then follows that E,(Y\X m ) = J2 iem 9iXi and 7(6> m ) is the conditional 
variance of Y given X m . As for it, the least squares estimator 9 m is the minimizer of 7„(.) over the 
space S m . 

9 m := arg min -f n (9') a.s. . 

It is^almost surely uniquely defined since S is assumed to be non-singular and since d m < n. Besides 
ln{9 m ) equals ||Y — II m Y||„. Let us derive two simple properties of 9 m that will give us some hints 
to perform model selection. 

Lemma 2.1. For any model m whose dimension is smaller than n— 1, the expected mean-squared 
error of 9 m and the expected least squares of 9 m respectively equal 

e[ 7 (M1 - [l(9 m ,9) + cr 2 ] (l+ _ d , m , (5) 

3 7n(« m )] = [l(e m ,6)+a 2 ] (l-^) • (6) 
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The proof is postponed to the Appendix. From Equation (O, we derive a bias variance decom- 
position of the risk of the estimator 8 m : 

dm. 



E 



l(6n 



l(0m,O)+[a 2 +l(fi m ,6)] 



1 



Hence, 9 m converges to 9 m in probability when n converges to infinity. Contrary to the fixed design 
regression framework, the variance term [a 2 + l(9 m , 9)] depends on the bias term l(9 m , 9). 

Besides, this variance term does not necessarily increase when the dimension of the model increases. 



Let us now explain the idea underlying our model selection procedure. We aim at choosing 
a model fa that nearly minimizes the mean-squared error j(9 m ). Since we do not have access to 
j (9 m ) nor to the bias l(9 m , 9), we perform an unbiased estimation of the risk as done by Mallows 
[24j in the fixed design framework. 



7 (0 m ) « Jn (dm) + E [ 7 (0m) - 7n ($») 



7 

*Yn \ w iii 

7n 



7n 



dm. 1 



n — dm 



dm + 1 

n — dm — 1 



(7) 



By Lemma I27TI these approximations are in fact equalities in expectation. Since the last expression 
only depends on the data, we may compute its minimizer over the collection M . This approximation 
is effective and minimizing provides a good estimator 9 when the size of the collection M. is 
moderate as stated in Theorem l3.ll We recall that || Y— n m Y||^ equals r y n (9 m )- Hence, our previous 
heuristics would lead to a choice of penalty pen(m) — (2 + ^d^-i ) m our criterion ([3]), 

whereas FPE criterion corresponds to pen(m) = 2dl ^ . These two penalties are equivalent when 
the dimension d m is small in front of n. In Theorern "l3.1t we explain why these criteria allow to 
derive approximate oracle inequalities when there is a small number of models. However, when the 
size of the collections M. increases, we need to design other penalties that take into account the 
complexity of the collection M (see Section I3.2|) . 



3 Oracle inequalities 

3.1 A small number of models 

In this section, we restrict ourselves to the situation where the collection of models M. only contains 
a small number of models as defined in 0| Sect 3.1.2. 

(Mpoi): for each d > 1 the number of models m € M. such that d m = d grows at most 
polynomially with respect to d. In other words, there exists a and (5 such that for any d > 1, 
Card ({to e M, d m = d}) < ad 13 . 

(M v ): The dimension d m of every model to in M. is smaller than r/n. Moreover, the number of 
observations n is larger than 6/(1 —T]). 
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Assumption (lp ;) states that there is at most a polynomial number of models with a given 
dimension. It includes in particular the problem of ordered variable selection, on which we will focus 
in this section. Let us introduce the collection of models relevant for this issue. For any positive 
number i smaller or equal to p, we define the model m, := {1, ...,«} and the nested collection 
Mi := {mo, m i7 ■ ■ ■ r m i \. Here, mo refers to the empty model. Any collection Mi satisfies (Hp„i) 
with P = and a = 1. 

Theorem 3.1. Let r\ be any positive number smaller than one. Assume that the collection M 
satisfies (B.p i) and (M. v ). If the penalty pen(.) is lower bounded as follows 



pen(m) > K- 



for all m € M and some K > 1 



(8) 



then 



E 



<L(K,r]) inf 



■pen{m) [a 2 + l{9 m , I 



+ T n 



(9) 



where the error term r n is defined as 

T n = r n [Var(Y), K, r], a, /3] := Li(K, 77, a, (3) 
and L2,{K,rj) is positive. 



a" 
n 



i 3 +P Var(Y) exp [-nL 2 (K, 77)] 



The theorem applies for any n, any p and there is no hidden dependency on n or p in the 
constants. Besides, observe that the theorem does not depend at all on the covariance matrix £ 
between the covariates. If we choose the penalty pen(m) = K — , we obtain an approximate 
oracle inequality. 



E I 



< L(K,rj) inf E /( 



T n [Yar(Y),K, v ,a,(3] , 



thanks to Lemma [2.11 The term in n 3+/3 Var(F) exp[— nLi(K, 77)] converges exponentially fast to 
when n goes to infinity and is therefore considered as negligible. One interesting feature of this 
oracle inequality is that it allows to consider models of dimensions as close to n as we want providing 
that n is large enough. This will not be possible in the next section when handling more complex 
collections of models. 



If we have stated that 6 performs almost as well as the oracle model, one may wonder whether 
it is possible to perform exactly as well as the ^racle. In the next proposition, we shall prove 
that under additional assumption the estimator 9 with K = 2 follows an asymptotic exact oracle 
inequality. We state the result for the problem of ordered variable selection. Let us assume for a 
moment that the set of covariates is infinite, i.e. p — +00. In this setting, we define the subset of 
sequences 9 = (9i)i>i such that < A, 9 > converges in L 2 . In the following proposition, we assume 
that 9 e e. 

Definition 3.1. Let s and R be two positive numbers. We define the so-called ellipsoid £' S (R) as 
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In Section |4~T| we explain why we call this set £' S (R) an ellipsoid. 

Proposition 3.2. Assume there exists s, s' , and R such that 9 £ £' S (R) and such that for any 
positive numbers R' , 9 £ £' S ,(R'). We consider the collection M.y n /2\ tmd the penalty pen(m) = 
. Then, there exists a constant L(s,R) and a sequence t„ converging to zero at infinity such 
that, with probability, at least 1 — L(s, ii)-^r , 

1(9,9) < [1 + T(n)] inf l(9 m ,9) . (10) 

Admittedly, we make n go to the infinity in this proposition but we are still in a high dimensional 
setting since p = +oo and since the size of the collection M. y n /2\ goes to infinity with n. Let us 
briefly discuss the assumption on 9. Roughly speaking, it ensures that the oracle model has a 
dimension not too close to zero (larger than log 2 (n)) and small before n (smaller than n/logn). 
Notice that it is classical to assume that the bias is non-zero for every model m for proving the 
asymptotic optimality of Mallows' C' p (cf. Shibata [3l| and Birge and Massart j^]). Here, we 
make a stronger assumption because the bound l|10p holds in probability and because the design is 
Gaussian. Moreover, our stronger assumption has already been made by Stone [12] and Arlot (jj. 
We refer to Arlot 0] Sect. 4.1 for a more complete discussion of this assumption. 

The choice of the collection M. y n /2\ is arbitrary and one can extend it to many collections that 
satisfy (ip i) and (EL,), As mentioned in Sectional the penalty pen(m) = 2 -jgjg corresponds to 
the FPE model selection procedure. In conclusion, the choice of the FPE criterion turns out to be 
asymptotically optimal when the complexity of M. is small. 

We now underline that the condition K > 1 in Theorem l3.1l is almost necessary. Indeed, choosing 
K smaller than one yields terrible statistical performances. 

Proposition 3.3. Suppose that p is larger than n/2. Let us consider the collection M.\ n /2\ and 
assume that for some v > 0, 

pen(m) = (1 — v) — — , (11) 

n - d m 

for any model m 6 M.y n /2\ ■ Then given 5 £ (0, 1), there exists some na{v, 5) only depending on v 
and 5 such that for n > no(u, 6), 



n 

> - 

~ 4 



> 1 — 6 and E 



1(6,6) >l(9 m[n/2p 9)+L(6,is)a 2 



If one chooses a too small penalty, then the dimension df^ of the selected model is huge and the 
penalized estimator 9 performs poorly. The hypothesis p > n/2 is needed for defining the collec- 
tion M. y n /2\ ■ Once again, the choice of the collection M. y n l2\ is rather arbitrary and the result of 
Proposition 13.31 still holds for collections M. which satisfy (Ip i) and (EL,) and contain at least one 
model of large dimension. Theorem l3.1l and Proposition l3.3l tell us that is the minimal penalty. 

In practice, we advise to choose K between 2 and 3. Admittedly, K = 2 is asymptotically optimal 
by Proposition 13.21 Nevertheless, we have observed on simulations that K = 3 gives slightly better 
results when n is small. For ordered variable selection, we suggest to take the collection M y n /2\ ■ 
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3.2 A general model selection theorem 

In this section, we study the performance of the penalized estimator 8 for general collections M. 
Classically, we need to penalize stronger the models to, incorporating the complexity of the collec- 
tion. As a special case, we shall consider the problem of complete variable selection. This is why 
we define the collections M.^ that consist of all subsets of {1, ... ,p} of size less or equal to d. 

Definition 3.2. Given a collection M., we define the function H(.) by 

H(d) := \ log [Card({m G M, d m = d})] , 
a 

for any integer d > 1 . 

This function measures the complexity of the collection M.. For the collection Mp 1 , H{k) is 
upper bounded by \og(ep/k) for any k < d (see Eq.(4.10) in [25]). Contrary to the situation en- 
countered in ordered variable selection, we are not able to consider models of arbitrary dimensions 
and we shall do the following assumption. 

(Wfc,ri)- Given K > 1 and r\ > 0, the collection M. and the number r\ satisfy 



Vto G M, 



l + ^2H(d n 



2 



dm 

-<V< V(K) , (12) 



n - d m 

where r)(K) is defined as r)(K) := [1 - 2(3/ (K + 2)) 1 / 6 ] 2 V/[l - (3/if + 2)^ e ] 2 /A. 

The function rj(K) is positive and increases when K is larger than one. Besides, n(K) converges 
to one when K converges to infinity. We do not claim that the expression of r](K) is optimal. We 
are more interested in its behavior when K is large. 

Theorem 3.4. Let K > 1 and let n < T)(K). Assume that n is larger than some quantity no(K) 
only depending on K and the collection Ai satisfies (M.k, v )- If the penalty pen(.) is lower bounded 
as follows 

d / \ 2 

pen(m) > K ^— 1 1 + y/2H(d m ) ) for any to G M , (13) 

then 



E 



1(6, 9) < L(K, V ) inf \ l(6 m , 6) + ^^ pe n(m) [a 2 + l(6 m , 6)} \ + r n , (14) 



where r„ is defined as 



r n [Var{Y), K^\:= a ^ K ^ + L 2 (K, V )n^ 2 Var(Y) exp [-nL 3 (K, V )} 



andLz(K,r/) is positive. 
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This theorem provides an oracle type inequality of the same type as the one obtained in the 
Gaussian sequential framework by Birge and Massart 0|. The risk of the penalized estimator 9 
almost achieves the inflmum of the risks plus a penalty term depending on the function H(.). As 
in Theorem 13. 11 the error term t„ [Var(F), K, rf\ depends on 9 but this part goes exponentially fast 
to with n. 



Comments: 

• As for Theorem l3.lt the result holds for arbitrary large p as long as n is larger than the quantity 
no(K) (independent of p). There is no hidden dependency on p except in the complexity 
function H{.) and Assumption Wk, v that we shall discuss for the particular case of complete 
variable selection. Moreover, one may easily check Assumption Mx tV since it only depends on 
the collection M. and not on some unknown quantity. 

• This result (as well as of Theorem 13. ip does not depend at all on the covariance matrix £ 
between the covariates. 

• The penalty introduced in this theorem only depends on the collection M. and a number 
K > 1. Hence, performing the procedure does not require any knowledge on er 2 , E, or 9. We 
give hints at the end of the section for choosing the constant K. 

• Observe that Theorem 13.11 is not just corollary of Theorem 13.41 If we apply Theorem 13.41 to 
the problem of ordered selection, then the maximal size of the model has to be smaller than 
n j+^K), which depends on K and is always smaller than n/2. In contrast, Theorem 13.11 
handles models of size up to n — 7. 



3.3 Application to complete variable selection 



Let us now restate Theorem 13.41 for the particular issue of complete variable selection. Consider 
K > 1, 7/ < rj(K) and d > 1 such that Aip satisfies Assumption (Mk, v )- If we take for any model 
m <G Aip the penalty term 



pen(m) = K- 



then we get 



E 



1(9,9) < L(K, 7]) inf 

m£M<i 




d m , ( ep 
n \ d m 



(15) 



•T„[Var(F),^,ry] 



We shall prove in Section 221 that the term \og(p/d m ) is unavoidable and that the obtained 
estimator is optimal from a minimax point of view. If the true parameter 9 belongs to some unknown 



model m, then the rates of estimation of 9 is of the order -^\og(p/d rl 
result with other procedures. 



Let us compare our 



• The oracle type inequalities look similar to the ones obtained by Birge and Massart [8] , Bunea 
et al. [IB] and Baraud et al. [1]. However, Birge and Massart and Bunea et al. assume that 
the variance a 2 is known. Moreover, Birge and Massart and Baraud et al. only consider a 
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fixed design setting. Yet, Bunea et al. allow the design to be random, but they assume that 
the regression functions are bounded (Assumption A. 2 in their paper) which is not the case 
here. Moreover, they only get risk bounds with respect to the empirical norm ||.||„ and not 
the integrated loss .). 

• As mentioned previously, our oracle inequality holds for any covariance matrix E. In contrast, 
Lasso and Dantzig selector estimators have been shown to satisfy oracle inequalities under 
assumptions on the empirical design X. In [l3| . Candes and Tao indeed assume that the 
singular values of X restricted to any subset of size proportional to the sparsity of 9 are 
bounded away from zero. Bickel et al. [H] introduce an extension of this condition prove 
both for the Lasso and the Dantzig selector. In a recent work [12], Candes and Plan state 
that if the empirical correlation between the covariates is smaller than L (log then the 
Lasso follows an oracle inequality in a majority of cases. Their condition is in fact almost 
necessary. On the one hand, they give examples of some low correlated situations, where the 
Lasso performs poorly. On the other hand, they prove that the Lasso fails to work well if the 
correlation between the covariates if larger than L(logp) -1 . Yet, Candes and Plan consider 
the loss function ||X0— X0||„, whereas we use the integrated loss 1(6, 9), but this does not really 
change the impact of their result. We refer to their paper for further details. The main point is 
that for some correlation structures, our procedure still works well, whereas the Lasso and the 
Dantzig selector procedures perform poorly. In many problems such as GGM estimation, the 
correlation between the covariates may be high and even the relaxed assumptions of Candes 
and Plan may not be fulfilled. In Section \5[ we illustrate this phenomenon by comparing 
our procedure with the Lasso on numerical examples for independent and highly correlated 
covariates. 

• Suppose that the covariates are independent and that 9 belongs to some model m, the rates 
of convergence of the Lasso is then of the order ™ log(p)er 2 , whereas ours is ™ log(p/d m )a 2 . 
Consider the case where p, and d m are of the same order whereas n is large. Our model 
selection procedure therefore outperforms the Lasso by a log(p) factor even if the covariates 
are independent. 



Let us restate Assumption (Mk }V ) for the particular collection M.p. Given some K > 1 and 
some r\ < r](K), the collection Mp satisfies if 



d < rj- 



i+ i + vW+WpW) 



(16) 



If p is much larger than n, the dimension d of the largest model has to be be smaller than the 
order r\ 2 lo g( p ) ■ Candes and Plan state a similar condition for the lasso. We believe that this 
condition is unimprovable. Indeed, Wainwright states in Th.2 of [38] a result going in this 
sense: it is impossible to estimate reliably the support of a fc-sparse vector 9 if n is smaller 
than the order fclog(p/fc). If log(p) is larger than n, then we cannot apply Theorem 13.41 This 
ultra-high dimensional setting is also not handled by the theory for the Lasso and the Dantzig 
selector. Finally, if p is of the same order as n, then Condition ([TBI is satisfied for dimensions 
d of the same order as n. Hence, our method works well even when the sparsity is of the same 
order as n, which is not the case for the Lasso or the Dantzig selector. 
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Let us discuss the practical choice of d and K for complete variable selection. From numerical 
studies, we advise to take d < , n , p — -yr Ap even if this quantity is slightly larger than what 

is ensured by the theory. The practical choice of K depends on the aim of the study. If one aims 
at minimizing the risk, K = 1.1 gives rather good result. A larger K like 1.5 or 2 allows to obtain 
a more conservative procedure and consequently a lower FDR. We compare these values of K on 
simulated examples in Section [5j 

3.4 Penalties based on a prior distribution 



The penalty defined in Theorem E3] only depends on the models through their cardinality. However, 
the methodology developed in the proof may easily extend to the case where the user has some 
prior knowledge of the relevant models. Let 7i>i be a prior probability measure on the collection 
M. . For any non-empty model mgjVl, we define l m by 

_ log(7ix(m)) 

By convention, we set lq> to 1. We define in the next proposition penalty functions based on the 
quantity l m that allow to get non-asymptotic oracle inequalities. 

Assumption (M l K ): Given K > 1 and r\ > 0, the collection M, the numbers l m and the number 
r] satisfy 

[l + ^ZQ 2 d m 

Vm e M, < r] < r)(K) , (17) 

n - d m 

where rj{K) is defined as in (Mk, v )- 

Proposition 3.5. Let K > 1 and let i] < rj{K). Assume that n > no{K) and that Assumption 
(W K ) is fulfilled. If the penalty pen(.) is lower bounded as follows 

d / , \ 2 



pen{m) > K ^— (1 + ^/2lZ) for any m £ M \ {0} , (18) 

n - d m \ J 

then 

E \l{0, 9)} < L{K, rj) inf \l(9 m , 9) + ^—^pen(m) [a 2 + l(d m , 9)] \+T n , (19) 
where L(K,rj) and r n are the same as in Theorem\3.4\ 
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Comments: 

• In this proposition, the penalty (JTSj) as well as the risk bound lfT9|) depend on the prior 
distribution ir M . In fact, the bound (fT9| means that 9 achieves the trade-off between the bias 
and some prior weight, which is of the order 

-\og[n M (m)}[a 2 +l(e m ,9)])/n . 

This emphasizes that 9 favours models with a high prior probability. Similar risk bounds are 
obtained in the fixed design regression framework in Birge and Massart Q. 

• If the proofs of Proposition 13.51 and Theorem 13.41 are very similar, Proposition 13.51 does not 
imply the theorem. 

• Roughly speaking, Assumption (Mr K ) requires that the prior probability TTM{fn) is not ex- 
ponentially small with respect to n. 



4 Minimax lower bounds and Adaptivity 

Throughout this section, we emphasize the dependency of the expectations E(.) and the probabilities 
P(.) on 9 by writing Eg and Pg. We have stated in Section [3] that the penalized estimator 9 performs 
almost as well as the best of the estimators 9 m . We now want to compare the risk of 9 with the 
risk of any other possible estimator estimator 9. There is no hope to make a pointwise comparison 
with an arbitrary estimator. Therefore, we classically consider the maximal risk over some suitable 
subsets of MP. The minimax risk over the set is given by infg-sup ege E,g[l(9, 9)], where the 

infimum is taken over all possible estimators 9 of 9. Then, the estimator 9 is said to be approximately 
minimax with respect to the set 9 if the ratio 



sup eee E<? 


i (§,e)' 




infg-supege E e 


i (e,e) 



is smaller than a constant that does not depend on a 2 , n, or p. The minimax rates of estimation 
were extensively studied in the fixed design Gaussian regression framework and we refer for instance 
to [1] for a detailed discussion. In this section, we apply a classical methodology known as Fano's 
Lemma in order to derive minimax rates of estimation for ordered and complete variable selection. 
Then, we deduce adaptive properties of the penalized estimator 9. 



4.1 Adaptivity with respect to ellipsoids 

In this section, we prove that the estimator 9 introduced in Section IBTTl to perform ordered variable 
selection is adaptive to a large class of ellipsoids. 

Definition 4.1. For any non increasing sequence (a.;)i<i< p +i such that a\ — 1 and a p+ i — and 
any R > 0, we define the ellipsoid E a {R) by 

£ a (R) :={9eW,j2 l{9m ^ em ' ) <R 2 ) . 
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This definition is very similar to the notion of ellipsoids introduced in [36j . Let us explain why 
we call this set an ellipsoid. Assume for one moment that the (Xi)i<i< p are independent identically 
distributed with variance one. In this case, the term I \9 mi _ x ,9 m ^ equals 9 2 and the definition of 
£ a (R) translates in 

£ a {R) = {0 € m ^ 



p,2J^ <R 2 



which precisely corresponds to a classical definition of an ellipsoid. If the (-Xi)i<j< p are not i.i.d. 
with unit variance, it is always possible to create a sequence X[ of i.i.d. standard Gaussian variables 
by orthonormalizing the Xj using Gram-Schmidt process. If we call 9' the vector in W such that 
X9 = X'9', then it holds that I (9 nit _ 1 ,9 mi ) = 9' 2 . Then, we can express £ a (R) using the coordinates 
of 9' as previously: 

f p 9' 2 1 

£ a (R) = {9E« P ,Y,^2 < r2 \ ■ 



The main advantage of this definition is that it does not directly depend on the covariance of 

(Xi)i<i< p . 

Proposition 4.1. For any sequence (a,i)i<i< p and any positive number R, the minimax rate of 
estimation over the ellipsoid £ a (R) is lower bounded by 



(20) 



inf sup Eg 


1(9,9) 


> L sup 


afR 2 A — 


e ee£ a (R) 




l<i<p 


n 



This result is analogous to the lower bounds obtained in the fixed design regression framework 
(see e.g. [Hj Th. 4.9). Hence, the estimator 9 built in Section [3TT1 is adaptive to a large class of 
ellipsoids. 

Corollary 4.2. Assume that n is larger than 12. We consider the penalized estimator 9 with the 



collection M\ n /2\ and the 



pen(m) = K- 



-dr, 



Let £ a {R) be an ellipsoid whose radius R 



satisfies ^— < R < a n" for some [3 > 0. Then, 9 is approximately minimax on £ a {R) 



sup 1(9,9) < £(#,/?) inf sup E fl I 



if either n > 2p or a? n 



\n/2\+l R2 ^ V 2 / 2 - 



In the fixed design framework, one may build adaptive estimators to any ellipsoid satisfying 
R 2 > <J 2 /n so that the ellipsoid is not degenerate (see e.g. [13] Sect. 4.3.3). In our setting, 
when p is small the estimator 9 is adaptive to all the ellipsoids that have a moderate radius 
a 2 jn < R 2 < . The technical condition R 2 < rfi is not really restrictive. It comes from 
the term n 3 l(0 p , 9) exp(— nL(K)) in Theorem 13.11 which goes exponentially fast to with n. When 
p is larger, 9 is adaptive to the ellipsoids that also satisfies a 2 n , 2 ^ +1 R 2 < <r 2 /2. In other words, 
we require that the ellipsoid is well approximated by the space S m y n /2] of vectors 9 whose support 

is included in {1, . . . , |_ri/2j } . If this condition is not fulfilled, the estimator 9 is not proved to be 
minimax on £ a (R)- For such situations, we believe on the one hand that the estimator 9 should be 
refined and on the other hand that our lower bounds are not sharp. Finally, the collection M. l„/2J 
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may be replaced by any M. \ nri \ in Corollary 14,21 

Since the methods used for minimax lower bounds and the oracle inequalities are analogous to 
the ones in the Gaussian sequence framework, one may also adapt in our setting the arguments 
developed in [13] Sect. 4.3.5 to derive minimax rates of estimation over other sets such Besov 
bodies. However, this is not really relevant for the regression model {!]). 



4.2 Adaptivity with respect to sparsity 



Our aim is now to analyze the minimax risk for the complete variable selection problem. Let us 
fix an integer k between 1 and p. We are interested in estimating the vector 9 within the class of 
vectors with a most k non-zero components. This typically corresponds to the situation encountered 
in graphical modeling when estimating the neighborhoods of large sparse graphs. As the graph is 
assumed to be sparse, only a small number of components of 9 are non-zero. 

In the sequel, the set 6[fe,p] stands for the subset of vectors 9 € R p , such that at most k 
coordinates of 9 are non-zero. For any r > 0, we denote Q[k,p](r) the subset of 6[fc,p] such that 
any component of 9 is smaller than r in absolute value. 

First, we derive a lower bound for the minimax rates of estimation when the covariates are 
independent. Then, we prove the estimator 9 defined with some collection M.^ and the penalty 
(HU is adaptive to any sparse vector 9. Finally, we investigate the minimax rates of estimation for 
correlated covariates. 



Proposition 4.3. Assume that the covariates Xi are independent and have a unit variance. For 
any k < p and any radius r > 0. 



inf sup Eg 

e 6e&[k,p](r) 



> Lk 



r 2 A a 1 



1 + log (|) 



(21) 



Thanks to Theorem I3.4[ we derive the minimax rate of estimation over Q[k,p\. 



Corollary 4.4. Consider K > 0, (3 > 0, and n < rj{K). Assume that n > tiq(K) and that the 
covariates Xi are independent and have a unit variance. Let d be a positive integer such that Aip 1 

satisfies (Mk, v )- The penalized estimator 9 defined with the collection Mp 1 and the penalty [IT 
adaptive minimax over the sets Q[k,p](n°) 



SUp Eg 

for any k smaller than d. 



<L(K,0,rj)M sup Eg 
8 eee[fe,p](n' 3 ) 



1(0,0) 



Hence, the minimax rates of estimation over Q[k,p](n^) is of order k '" by k 1 , which is similar 
to the rates obtained in the fixed design regression framework. As in previous Section, we restrict 
ourselves to a radius r in Q[k,p](r) smaller than rfi because of the term r n (Var(F), K, nf) which de- 
pends on l(0 p , 9) but goes exponentially fast to when n goes to infinity. Let us interpret Corollary 
13] with regard to Condition (fl6| . If p is of the same order as n, the estimator 9 is simultaneously 
minimax over all sets Q\k,p]{wP) when k is smaller than a constant times n. If p is much larger 
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than n, the estimator is simultaneously minimax over all sets Q[k,p](n") with k smaller than 
Ln/log(p). We conjecture that the minimax rate of estimation is larger than klog(p/k)/n when 
k becomes larger than n/logp. Let us mention that Tsybakov [HH] has proved general minimax 
lower bounds for aggregation in Gaussian random design regression. However, his result does not 
apply in our Gaussian design setting setting since he assumes that the density of the covariates Xi 
is lower bounded by a constant /zo. 

We have proved that the estimator is adaptive to an unknown sparsity when the covariates 
are independent. The performance of 9 exhibited in Theorem 13.41 do not depend on the covari- 
ance matrix E. Hence, the minimax rates of estimation on 6[fe,p] is smaller or equal to the order 
k\og{p/k)/n for any dependence between the covariance. One may then wonder whether the mini- 
max rate of estimation over 0[fc,p] is not faster when the covariates are correlated. We are unable 
to derive the minimax rates for a general covariance matrix S. This is why we restrict ourselves 
to particular examples of correlation structures. Let us first consider a pathological situation: As- 
sume that Xi, . .., Xk are independent and that Afc+i, ■ ■ • , X p are all equal to X\. Admittedly, the 
covariance matrix S is henceforth non invertible. In the discussion, we mention that Theorems 13.11 
and !3.4l easily extend when E is non-invertible if we take into account that the estimators 9 m and m 
are non-necessarily uniquely defined. We may derive from Lemma r2.ll that the estimator fl/i,...^} 
achieves the rate k/n over 9[k,p](n l3 ). Conversely, the parametric rate k/n is optimal. However, the 
estimator 6 defined with the collection Mp and penalty lfl5|) only achieves the rate k\og(p/k)/n. 

Hence, 9 is not minimax over 0[fc,p] for this particular covariance matrix and the minimax rate is 
degenerate. This emergence of faster rates for correlation covariates also occurs for testing problems 
in the model |T]) as stated in Sect. 4.3. This is why we provide sufficient conditions on £ so 
that the minimax rate of estimation is still of the same order as in the independent case. In the 
following proposition, ||.|| refers to the canonical norm in R p . 

Proposition 4.5. Let \& denote the correlation matrix of the covariates (Xi)x<i<p. Let k be a 
positive number smaller p/2 and let 5 > 0. Assume that 

(1-<5) 2 ||0|| 2 < 0**0 < (1 + 5) 2 ||0|| 2 , (22) 

for all 9 £ R p with at most 2k non-zero components. Then, the minimax rate of estimation over 
Q[k,p](r) is lower bounded as follows 



inf sup E 9 1(9,9) >L(l-5) 2 k 
e eee[k, P ](r) L - 1 



2 A 2 1 + lo S 1 



{1 + S) 2 n 



Assumption (l22l corresponds to the (5-Restricted Isometry Property of order 2k introduced by 
Candes and Tao [lJ|. Under such a condition, the minimax rates of estimation is the same as 
the one in the independent case up to a constant depending on S and the estimator 9 defined in 
Corollary 14.41 is still approximately minimax over such sets Q[k,p\. 

However, the (5-Restricted Isometry Property is quite restrictive and seems not to be necessary 
so that the minimax rate of estimation stays of the order klog(p/k)/n. Besides, in many situations 
this condition is not fulfilled. Assume for instance that the random vector A is a Gaussian Graphical 
model with respect to a given sparse graph. We expect that the correlation between two covariates 
is large if they are neighbors in the graph and small if they are far-off (w.r.t. the graph distance). 
This is why we derive lower bounds on the rate of estimation for correlation matrices often used to 
model stationary processes. 
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Proposition 4.6. Let X\, . . . ,X p form a stationary process on the one dimensional torus. More 
precisely, the correlation between Xi and Xj is a function of\i — j\ p where \.\ p refers to the toroidal 
distance defined by: 

\i-j\ P -={\i-j\)/\(p-\i-j\) ■ 
^i(u>) and ^2(t) respectively refer to the correlation matrix of X such that 

corr(Xi,Xj) := exp(— u>\i ~ j\ p ) where u > 0, 
corr(Xi,Xj) := (1 + \i - j| p ) _t where t > 0. 

Then, the minimax rates of estimation are lower bounded as follows 



inf sup Ee,*!^) 
e eee[k, P ] 



> L- 



1 + log 



if k is smaller than p / \\og(Ak) / uS\ and 



inf sup Ee,* 2 (t) 
9ee[k, P ] 



l(6,i 



> L- 



ka 2 



1 + log 



if k is smaller than pj |~(4fc) * — 1] . 



| Vr(4fc)T-ii-ij ' 



In the proof of the proposition, we justify that the correlations considered are well-defined at 
least when p is odd. Let us mention that these correlation models are quite classical when modelling 
the correlation of time series (see e.g. 

If the range cu is larger than 1/p 1 or if the range t is larger than 7 for some 7 < 1, the lower 
bounds are of order c 2 ^(l + logp/fc). As a consequence, for any of these correlation models the 
minimax rate of estimation is of the same order as the minimax rate of estimation for independent 
covariates. This means that the estimator 8 defined in Proposition 14.41 is rate-optimal for these 
correlations matrices. 



In conclusion, the estimator 6 defined in Corollary 14.41 may not be adaptive to the covariance 
matrix S but rather achieves the minimax rate over all covariance matrices S: 



sup sup Ee 



< L(K, (3, 77) inf sup sup Eg 

9 £>o eee[k,p](n0) 



1(0,0) 



Nevertheless, the result makes sense if one considers GGMs since the resulting covariance matrices 
are typically far from being independent. 



5 Numerical study 

In this section, we carry out a small simulation study to evaluate the performance of our estimator 
9. As pointed out earlier, an interesting feature of our criterion lies in its flexibility. However, we 
restrict ourselves here to the variable selection pro blem. Indeed, it allows to assess the efficiency of 
our procedure with having regard to the Lasso [34| and adaptive Lasso proposed by Zou [4lJ . Even 
if these two procedures assume that the conditional variance c 2 is known, they give good results 
in practice and the comparison with our method is of interest. The calculations are made with R 
|www . r-pro j ect . o rg/. 
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5.1 Simulation scheme 

We consider the regression model ([I]) with p = 20, and er 2 = 1. The number of observations n equal 
15, 20, and 30. We perform two simulation experiments. 

1. First simulation experiment: The covariance matrix Ei is the identity matrix. This corre- 
sponds to the situation where the covariates are all independent. The vector 9\ has all its 
components to zero except the three first ones, which respectively equal 2, 1, and 0.5. 



2. Second simulation experiment: Let A be the p x p matrix whose lines (ai, . 
tively defined by 



, a p ) are respec- 



ai 

«2 
"3 



:= (1,-1,0,...,0)/V2 
:= (-1,1.2,0,...,0)A/T 



1.2 2 



(1/V2, 1/V2, 1/p, . . . , l/p)A/l/2+(p-2)/p2 



and for 4 < j < p, a,j corresponds to the j th canonical vector of MP. Then, we take the 
covariance matrix £2 = A* A and the vector 9% = (40, 40, 0, . . . , 0). This choice of parameters 
derives from the simulation experiments of Q • Observe that the two first covariates are highly 
correlated. 



For each sample we estimate with our procedure, the Lasso and the adaptive Lasso. For 
our procedure we use the collection for n — 15, A4* for n = 20 and, A4p for n — 30. The 
choice of smaller collections for n — 15 and 20 is due to Condition lfl6| . We take the penalty 
(fl5| with K = 1.1 1.5, and 2. For the Lasso and adaptive Lasso procedures, we first normalize 
the covariates (Xi). Here, 2-^/log pa would be a good choice for the parameter A of the Lasso. 
However, we do not have access to a. Hence, we use an estimation of the variance Var(Y) which 
is a (possibly inaccurate) upper bound of a 1 . This is why we choose the parameter A of the Lasso 

between 0.3 x 2\ 



'logpVar(F) and 2ylogpVar(y) by leave-one-out cross-validation. The number 
0.3 is rather arbitrary. In practice, the performances of the Lasso do not really depend on this 
number as soon it is neither too small nor close to one. For the adaptive Lasso procedure, the 
parameters 7 and A are also estimated thanks to leave-one-out cross-validation: 7 can take three 

values (0.5, 1, 2) and the values of A vary between 0.3 x 2\ 



logpVar(F) and 2A/log(p)Var(F). 



We evaluate the risk ratio 



E 




0) 






l(9 m , 9) 



ratio. Risk : 



as well as the power and the FDR on the basis of 1000 simulations. Here, the power corresponds 
to the fraction of non-zero components 9 estimated as non-zero by the estimator 9, while the FDR 
is the ratio of the false discoveries over the true discoveries. 



Power 



Card({i, 9, ^ and 9, t ^ 0}) 
Card {{i, 9, £ 0}) 



and FDR := E 



Card({i, 9 t = and 0* ^ 0}) 



Card({i, 0, + 0}) 
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n = 15 


n = 20 


Estimator 


ratio. Risk Power FDR 


ratio. Risk Power FDR 


K = 1.1 


4.8 ±0.4 0.67 ± 0.02 0.23 ±0.02 


4.8 ±0.3 0.77 ±0.01 0.28 ± 0.02 


K = 1.5 


5.7 ±0.4 0.62 ± 0.02 0.20 ±0.01 


5.3 ±0.4 0.74 ±0.02 0.25 ± 0.01 


K = 2 


7.3 ±0.5 0.54 ± 0.02 0.17 ±0.01 


6.6 ±0.5 0.68 ±0.02 0.21 ± 0.01 


Lasso 


5.8 ±0.2 0.64 ±0.01 0.29 ±0.02 


6.0 ±0.2 0.74 ±0.01 0.23 ±0.01 


A. Lasso 


4.8 ±0.3 0.64 ± 0.02 0.30 ±0.02 


4.7 ±0.4 0.75 ±0.02 0.30 ±0.01 





n = 30 


Estimator 


ratio. Risk Power FDR 


K = 1.1 


4.2 ± 0.3 0.87 ±0.01 0.23 ± 0.02 


K = 1.5 


4.1 ± 0.2 0.84 ±0.01 0.19 ±0.01 


K = 2 


4.3 ±0.2 0.81 ±0.01 0.14 ±0.01 


Lasso 


6.6 ± 0.2 0.83 ±0.01 0.18 ±0.01 


A. Lasso 


4.3 ± 0.5 0.86 ±0.02 0.26 ± 0.01 



Table 1: Our procedure with K = 1.1, 1.5, and 2 and Lasso and adaptive Lasso procedures: 
Estimation and 95% confidence interval of Risk ratio (ratio. Risk), Power and FDR when p = 20, 
£ = S 2 , = 2 , and n = 15, 20, and 30. 





n= 15 


n = 20 


Estimator 


ratio. Risk Power FDR 


ratio. Risk Power FDR 


K = 1.1 


5.3 ±0.4 0.77 ±0.03 0.41 ± 0.02 


6.4 ± 0.5 0.87 ±0.02 0.39 ± 0.02 


K = 1.5 


5.3 ±0.4 0.76 ±0.03 0.41 ± 0.02 


5.9 ± 0.5 0.87 ±0.02 0.36 ±0.02 


K = 2 


5.5 ± 0.5 0.75 ±0.03 0.40 ±0.02 


5.5 ± 0.5 0.86 ±0.02 0.33 ± 0.02 


Lasso 


13.5 ±0.3 0.02 ±0.01 0.99 ±0.01 


16.7 ± 0.3 0.02 ±0.01 0.98 ±0.01 


A. Lasso 


15.0 ± 1.2 0.02 ±0.01 0.90 ±0.02 


20.5 ± 1.8 0.04 ±0.01 0.89 ±0.02 





n = 30 


Estimator 


ratio. Risk Power FDR 


K = 1.1 


4.5 ± 0.3 0.96 ± 0.02 0.24 ±0.02 


K = 1.5 


3.9 ± 0.3 0.95 ±0.01 0.19 ±0.02 


K = 2 


3.5 ± 0.3 0.94 ±0.01 0.16 ±0.02 


Lasso 


22.0 ±0.3 0.02 ±0.01 0.99 ±0.01 


A. Lasso 


31.8 ±3.0 0.04 ±0.01 0.88 ±0.02 



Table 2: Our procedure with K = 1.1, 1.5, and 2 and Lasso and adaptive Lasso procedures: 
Estimation and 95% confidence interval of Risk ratio (ratio. Risk), Power and FDR when p = 20, 
S = Ei, = 6>i, and n = 15, 20, and 30. 

5.2 Results 

The results of the first simulation experiment are given in Table [TJ We observe that the five estima- 
tors perform more or less similarly as expected by the theory. The results of the second simulation 
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study are reported in Table [2j Clearly, the Lasso and adaptive Lasso procedures are not consistent 
in this situation since the power is close to and the FDR is close to one. Consequently, the risk 
ratio is quite large and the adaptive Lasso even seems unstable. In contrast, our method exhibits 
a large power and a reasonable FDR. 

In the two studies, choosing a larger K reduces the power of the estimator but also decreases 
the FDR. It seems that the choice K — 1.1 yields a good risk ratio, whereas K = 2 gives a better 
control of the FDR. Contrary to the parameter A for the lasso, we do not need an ad-hoc method 
such as cross-validation to calibrate K. The second example is certainly quite pathological but it 
illustrates that our estimator performs well even when the Lasso does not provide an accurate 
estimation. The good behavior of our method illustrates the strength of Theorem [331 that does not 
depend on the correlation of the explanatory variables. 

6 Discussion and concluding remarks 

Until now, we have assumed that the covariance matrix S of the covariates is non-singular. If £ 
is singular, the estimators 9 m and the model m are not necessarily uniquely defined. However, 
upon defining 6 m as one of the minimizers of 7n(#') over S m , one may readily extend the oracle 
inequalities stated in Theorem 13.11 and 1 3.41 

Let us recall the main features of our method. We have defined a model selection criterion that 
satisfies oracle inequalities regardless of the correlation between the covariates and regardless of the 
collection of models. Hence, the estimator achieves nice adaptive properties for ordered variable 
selection or for complete variable selection. Besides, one can easily combine this method with prior 
knowledge on the model by choosing a proper collection M or by modulating the penalty pen(.). 
Moreover, we may easily calibrate the penalty even when a 2 is unknown, whereas the Lasso-type 
procedures require a cross-validation strategy to choose the parameter A. The compensation for 
these nice properties is a computational cost that depends linearly on the size of M.. Hence, the 
complete variable selection problem is NP-hard. This makes it intractable when p becomes too 
large (i.e. more than 50). In contrast, our criterion applies for arbitrary p when considering or- 
dered variable selection since the size of M is linear with n. In situations where one has a good 
prior knowledge on the true model, the collection M. is then not too large and our criterion is also 
fastly calculable even for large p. 

For complete variable selection, Lasso-type procedures are computationally feasible even when 
p is large and achieve oracle inequalities under assumptions on the covariance structure. However, 
there are both theoretical and practical problems with these estimators. On the one hand, they 
are known to perform poorly for some covariance structures. On the other hand, there is some 
room for improvement in the practical calibration of the lasso, especially when a 1 is unknown. In a 
future work, we would like to combine the strength of our method with these computationally fast 
algorithms. The problem at hand is to design a fast data-driven method that picks a subcollection 
M of reasonable size. Afterwards, one applies our procedure to M instead of M. A direction that 
needs further investigation is taking for M all the subsets of the regularization path given by the 
lasso. 
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7 Proofs 

7.1 Some notations and probabilistic tools 

First, let us define the random variable e m by 

Y = X0 rn + e m + e a.s. . (23) 

By definition of 8 m , e m follows a normal distribution and is independent of e and of X m . Hence, 
the variance of e m equals l(9 m ,6). The vectors e and e m refer to the n samples of e and e m . 
For any model m and any vector Z of size n, H m Z stands for Z — H m Z. For any subset m of 
{1, . . . ,p}, S TO denotes the covariance matrix of the vector X m . Moreover, we define the row vector 
Z m :— X m v%/ in order to deal with standard Gaussian vectors. Similarly to the matrix X m , the 
nx d m matrix Z m stands for the n observations of Z m . The notation (., .)„ refers to the empirical 
inner product associated with the norm ||.||„. Lastly, ip max {A) denotes the largest eigenvalue (in 
absolute value) of a symmetric square matrix A. 

We shall extensively use the explicit expression of 9 m : 

X0 m = X m (X* n X m ) 1 XJ n Y . (24) 

Let us state a first lemma that gives the expressions of 7 n (# m ), j(0 m ), and the loss l(6 m , 9 m ). 
Lemma 7.1. For any model m of size smaller than n, 

ln(0 m ) = ||n^(e + e m )||* , (25) 

i(e m ) = o- 2 + i(d m ,e) + i(e m ,e m ) , (26) 

l(0 m ,6 m ) = (e + e m )*Z m (Z;„Z m )- 2 Z;„(e + e m ) . (27) 
The proof is postponed to the Appendix. 



We now introduce the main probabilistic tools used throughout the proofs. First, we need to 
bound the deviations of % 2 random variables. 

Lemma 7.2. For any integer d > and any positive number x, 

'x 2 (d) <d- 2Vdx] < exp(-x) , 



(x 2 ( d ) >d + 2v r dx + 2x^j < exp(-ir) 



These bounds are classical and are shown by applying Laplace method. We refer to Lemma 
1 in [22] for more details. Moreover, we state a refined bound for the lower deviations of a x 2 
distribution. 



Lemma 7.3. For any integer d > and any positive number x, 

-\ 2" 



X 2 (d) < d 



1-S d - 



VO 



< exp(— x) , 
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where 5d ■= \j — + exp(— d/16) 



(28) 



The proof is postponed the Appendix. Finally, we shall bound the largest eigenvalue of standard 
Wishart matrices and standard inverse Wishart matrices. The following deviation inequality is taken 
from Theorem 2.13 in (l7j |. 

Lemma 7.4. Let Z*Z be a standard Wishart matrix of parameters (n,d) with n > d. For any 
positive number x, 



[{z*z)- l \ 



> 



n I 1 — \ I x 

n 



< exp(-na; 2 /2) 



and 



<Anax (Z*Z) <n[ 1 + \l - +x 



< exp(— nx 2 /2) 



7.2 Proof of Theorem [331 

Proof of Theorem \3.1[ For the sake of simplicity we divide the main steps of the proof in several 
lemmas. First, let us fix a model m in the collection A4. By definition of m, we know that 

1n(6) [1 + pen(rh)} < 7„(0 m ) [1 + pen(m)} . 

Subtracting 7(6*) to both sides of this inequality yields 

10,6) < l(6 m ,0)+^ n (e m )pen(m)+^ n (0 m )-7n(9)pen(m)-^ n {9) , (29) 

where 7 n (.) := 7n(-) — 7(-)- The proof is based on the concentration of the term — 7„(0). More 
precisely, we shall prove that with overwhelming probability this quantity is of the same order as 
the penalty term ^ n {0)pen(rh) . 

Let m and K2 be two positive numbers smaller than one that we shall fix later. For any model 
m' € M., we introduce the random variables A m > and B m > as 



Bn 



Ki + l 

K- d - 



in- 1 e -II 2 



K 2 n<p n 

m')\\ n 



K- 



n-d m i l(6 m ,,9)+a 2 

-1 (n m /£, H m ,e m i) n ||ri T , 

d m > 11^,(6 + e m US 



- d 7l 



n 2 ntp n 



|II m (e + e m <)\\l 



|IT m /(e + e. 



m' )\\n 



l(9 m ,,9) + a 2 



(30) 



(31) 



We recall that the notations e m , Z m , 



)n, and '/'max 

(.) are defined in Section [T7TJ We may upper 
bound the expression — 7 n (0) — r y n (9)pen(m) with respect to Afn and Bs as follows. 
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Lemma 7.5. Almost surely, it holds that 



' 7»(0) " ln(9)pen(m) - a 2 + \\ef n < 1(9, 9) [A r?l V (1 - K 2 )} + a 2 Bz 



Let us set the constants 



1 (K-l)(l-^) 2 

K\ := - and k 2 := — A 1 

4 16 



(32) 



(33) 



We do not claim that this choice is optimal, but we are not really concerned about the constants 
for this result. The core of this proof consists in showing that with overwhelming probability the 
variable Aff, is smaller than 1 and Bff, is smaller than a constant over n. 



Lemma 7.6. The event Sli defined as 



K- 1 



satisfies P(fif) < LCard(.M) exp [— nL'(K, rf)}, where L'{K,rf) is positive. 

Lemma 7.7. There exists an event Q 2 of probability larger than 1 — exp (— nL) with L > such 
that 



■ [-Bf^lfiinfia] < 



L(K,r),a,p) 



Gathering the upper bound ([29)1 and Lemma I7.5[ 17.61 and 17.71 we conclude that 



E 



l(9,9)l ni nn 2 ( k 2 A - 



< l(9 m ,6)+E[ ln (9 m )pen(m)} 
2 L(K,i],a,f3) 



E[l ni 



no 2 (7„(^m) + T 2 -||e|| 2 „)] 



As the expectation of the random variable 7„ (9 m ) + a 1 



is zero, it holds that 



E [ln inna (7„ (M + <r 2 - ||e|| 2 )] = E [l n! un S (l n (M + a 2 - ||e|| 2 )] 

< ^P(^)+P(0-) [^E[||6 m ||2 -l(9 m ,8)] 2 + 2^E{{e,e m )l] 



< ^/p(fif) + P(fi§)y~ [f(e mj fl) + o-v^(0) 

The probabilities P(OJ) and P^jjj) converge to at an exponential rate with respect to n. Hence, 
by taking the infimum over all the models mE Af, we obtain 



E 



Lninn 2 



< L{K,n) inf [l(9 m ,6)+ (a 2 + l(6 m ,0)) pen(m)} + L 2 (K,T),a, P)- 



+ L 3 (K, v )\j Card (- M ) y + i( 0p:6 )] exp [-nU{K,rj)\ 



(34) 



with L 4 (K, rj) > 0. In order to conclude, we need to control the loss of the estimator 9 on the event 
of small probability Of U ft^ ■ Thanks to the following lemma, we may upper bound the r-th risk of 
the estimators 9 m . 
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Proposition 7.8. For any model m and any integer r > 2 such that n — d rn — 2r + 1 > 0, 



1(9 m, 8n 



< 



Lrd m n [cr 2 +l(0 m ,9)] 



The proof is postponed to Section !?^! We derive from this bound a strong control on E 1(9, 9)l^ u n^ 
Lemma 7.9. 



E 



1(0, 9)ln^ 2 < L(K, v )n 2 Card(M) Var(Y) exp [-nL'(K, rj)] 



(35) 



where L'(K,r/) is positive. 

By Assumptions (ip i) and (EL,), the cardinality of the collection of M is smaller than an 1+l3 . 
We gather the upper bounds 1(34]) and IpS]) and so we conclude. □ 



Proof of Lemma\%M Thanks to Lemma [7TT| we decompose J n (9) as 

j n (9) = ||ni(e + e ffi )|| 2 -a 2 - l(9 m ,9) - (1 - K 2 )l(6,9 m ) - « 2 (e + e^)*Z m (Z m Z m )- 2 Z m ( e + e ~) 
Since 2ab < Kia 2 + k^ 1 ^ 2 for any n\ > 0, it holds that 



|ni(e + Clfl )|| 2 +||e|| 



flmClIn 



Kl ?rT7^ ~ 1 n 



l\9fhi 9) 



nie-ii 2 

K^fhi 9) 



+ K\ 



o 2 l(8 m ,8) 

Besides, we upper bound Expression l|27|) of 1(9, 8 m ) using the largest eigenvalue of (ZIZ,^) . 
(e + e f f l )*Zff l (Z* Ji Zff l )^ 2 Z* ffl (e + efh) < <Pmax [(Z~ Z ffi ) _1 ] (e + e ffl )* Z f f l (Z~ l Zfn)~ 1 Z* fh (e + e m ) 



< [a 2 +l(8 m ,8)]n Vm ^[(Z m Z m ) 
Thanks to Assumption (JSj) , we upper bound the penalty terms as follows: 
- ln (9)pen(rh) < - [a 2 + 1(6*, 9)] ^1±I^$L K . d ™ 



I (6 -f" ^m)||n 

a 2 + l(9 m ,9) 



(36) 



cr 2 + l(9 m ,9) n-d m 
By gathering the four last identities, we get 

-ln@) - ln(8)pen(m) - a 2 + ||e|| 2 < 1(9,8) [A ffl V (1 - K 2 )] + a 2 B m , 
since 1(9,8) decomposes into the sum 1(8, 8m) + l(0fh,9). 
Proof of Lemma 17. fil We recall that for any model m S M, 



□ 



A, 



5 
4 

K 



n m e m ||„ r/r»* r» \ — in l|n, n (e + e m ) || 

K 2 ni^ max [(Z m Z m ) 



^(#m, #) 

||n^(e + e m )|| 2 
n-d m l(8 m ,8)+a 2 



l(8 m ,8)+a 2 
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In order to control the variable , we shall simultaneously bound the deviations of the four random 
variables involved in any variable A m . 

Since X m is independent of e m /y/l(0 m , 6) and since e m / y/l(6 m , 9) is a standard Gaussian vector 
of size n, the random variable n||n^e m ||^/i(# m , 6) follows a x 2 distribution with n — d m degrees 
of freedom conditionally on X m . As this distribution does not depend on X m , n||II^e m ||^/Z(0 m , ff) 
follows a x 2 distribution with n — d m degrees of freedom. Similarly, the random variables n||II m (e + 
£m)\\n/[l(O m ,0) +cr 2 ] and ra||EL^(e + e m )|| 2 /[/(# m , 6) + er 2 ] follow x 2 distributions with respectively 
d m and n — d m degrees of freedom. Besides, the matrix (Z^Z m ) follows a standard Wishart 
distribution with parameters (n,d m ). 

Let x be a positive number we shall fix later. By Lemma [731 and 17^41 there exists an event Oj 
of large probability 

P(n[ c ) < 4exp(-nx)Card(7W) , 
such that for conditionally on f2j , 



ITT- l f II 2 

11 m t "i II n 



l{&m, 0) 

|n m (e + e 



2 

m)\\ n 



<J 2 + l(0 m ,9) 



~-rn )\\' n 



|n,i(e 



9n 



<r 2 + l(8 m ,0) 
(Z* n Z m ) 



> 



< 



> 



< 



n 

d"m, 

n 

n — d, t 
n 



(n - d m )x 



- 2 



(n - d m )x 



2x V 



(37) 
(38) 
(39) 

(40) 



for every model m £ M. Let us prove that for a suitable choice of the number x, A^l^ is smaller 
than 7/8. First, we constrain nK^Vmax (Z^Z^) 1 to be smaller than ^f^- on the event Q[. By 
(l40l). it holds that 



(Z^Zm) 



< 



1-y/fj- V2x) V 



Constraining a; to be smaller than - — ensures that the largest eigenvalue of (Zl Z^) satisfies 



nip n 



(ZffiZ, 



< 



By definition ([33| of K2, it follows that n^^max \^*m7>fh) 
2ab < Sa 2 + d^b 2 to the bounds p7 ]) . pj jl . and PS ]) yields 



< (li" — l)/4. Applying inequality 



n 2 nip n 



(ZffiZa) 
-A' 



inie-11 2 



1(6 rh, 8 
\U ffl (e + e ffl )\ 



o 2 + l(0 m ,9) 
dfh ||ni(e + effi )|| 2 



n — dm a 2 + l{9fn,0) 



< 



: K — 1 



1 . 

2 ~2n~ 

d 



2x 



3x 

~ + Y 

dfft 2Kij 

< —K hi 

2n 1 — 7} 
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Gathering these three inequalities, we get 



Afhln^ < j + x 



„ 3(if — 1) nTr 77 



1-r, 



If we set £ to 



8(2 + ^W' 



1-7? 



then Ajslo' is smaller than X and the result follows. 



□ 



Proof of Lemma 17, 71 We shall simultaneously bound the deviations of the random variables in- 
volved in the definition of B m for all models m G M. Let us first define the random variable E m 



as 



Factorizing by the norm of e, we get 



-l (n m e, n m e 



2 



n er 

|- LJ -m c |ln 



a 2 l(6 mi 9) 



-iNln<p£§T' n ™ em >« , lin m e||2 



1 ,72 



(41) 



The variable n^l^ follows a % 2 distribution with n degrees of freedom. By Lemma E2] there exists 
an event of probability larger than 1 — exp (n/8) such that ^f- is smaller than 2. As kJ -1 = 4, 



we obtain 



E m 1 



< 



/ ^tji c TT-L \2 



Since e, e m , and X m are independent, it holds that conditionally on X m and e, 



|n J -e||„ ' '-'"m' z m/n 



x 2 (i) 



Since the distribution depends neither on X m nor on e, this random variable follows a x 2 distribution 
with 1 degree of freedom. Besides, it is independent of the variable iE^lLa. Arguing as previously, 
we work out the distribution 



Consequently, the variable E m ln 2 is upper bounded by a random variable that follows the distri- 
bution of 

-Tr + l -T 2 , 

n n 

where T\ and T2 are two independent % 2 distribution with respectively 1 and d m degrees of free- 
dom. Moreover, the random variables n ^™ tf+emjHa and re ^ra^^f^r 2 respectively follow a x 2 
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distribution with d m and n — d m degrees of freedom. 



Let us bound the deviations of the random variables E m l< 



||n m ( e + em )|L 



and 



|n4( e +e m )||; 
1(0)+^ 



for 



any model m € M.. We apply Lemma 1 in [22j for E m ln 2 and Lemma [7721 for the two remaining 
random variables. Hence, for any x > 0, there exists an event ¥(x) of large probability 



P [¥(x) c ] < 



E « 

W6M 



< 



+ e 



such that conditionally on ¥(x), 



\\n m (e+e m )\\ 2 n 

Kd m ||(n^e+e m )|| 
n-d m <r 2 +l(9 m ,6) 



< 
< 

< 



^ + Isf &n + 8 2 ] ((jid— Txj + 16^±S 
i (d TO + 2y/d m [d m £ 2 + x] + 2 [d m £ 2 + a:)) 
-dm - 2yJ{n - d m )(£ 3 d m + x) 



for all models m e M. We shall fix later the positive constants £i, £2, and £3. Let us apply 
extensively the inequality 2ab < ra 2 + t _1 & 2 . Hence, conditionally on F(x), the model rh satisfies 



l(8^M)+cr 2 
Kd^ l |n4(e+e ffi )||; 

<T 2 +i(e ?K ,e) 



d~ 



< ^[l + 2V!i"+176+n] + ^[l7H 

< [i + 2V6 + 2e 2 + r 2 ] +f [2 

< 



72 



1-2,/e 



K — T. 

n ■ 



-1 4 



is smaller than ^j^- 



By Lemma EH we know that conditionally on fix, K^rup-n 

By assumption (H^), the ratio n l*g_ is smaller than j^-. Gathering these inequalities we upper 
bound Bfn on the event fii n 2 n ¥(x), 



s™< — U + -V + - 

n n n 



where U and V are defined as 
U := 1 + 2^ 

V := n + r^ + X^l [2 



176 + n 
if - 1 



if-lr 



1 + 2V6 + 26 + 7-2 



A" 



1 - 2 ve 



T3 



ATg- 1 



1-7? 



Looking closely at U, one observes that it is the sum of the quantity 



3(Jf-i) 



and an expression 



that we can make arbitrary small by choosing the positive constants £1, £2, £3, n, T2, and T3 small 
enough. Consequently, there exists a suitable choice of these constants only depending on K and r\ 
that constrains the quantity U to be non positive. It follows that for any x > 0, with probability 
larger than 1 — e~ x L(K, r), a, (3), 



n n 
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Integrating this upper bound for any x > 0, we conclude 

FfR i i < L(K,V,a,0) 



□ 

Proof of Lemma \7.9\ We perform a very crude upper bound by controlling the sum of the risk of 
every estimator 9 m , 



1(9, #)1qjuq§ 



< 



•(fiS) + p(n§) / ]T E[i(0 m ,< 

V inGAI 



As for any model m £ M, l(9 m , 9) = l(9 m , t 



i, 9 m ), it follows that 



E 



l@ m , Of] < 2 [l(6 m ,9f + E [l(9 m , 9 m f\ } 



For any model m £ M., it holds that n — d m — 3 > (1 — r))n — 3, which is positive by assumption 
(H^), Hence, we may apply Proposition 17,81 with r — 2 to all models m £ M: 

E l(9 m ,9 m ) 2 < L[d m n(a 2 +l(9 m ,9))] 
< Ln 4 \ax{Yf , 

since for any model m, a 2 + l(9 m , 9) < Var(Y). By summing this bound for all models m £ M and 
applying Lemma 171)1 and 17/7] we get 



E 



1(9, 0)l n! un§ < n 2 C&rd(M)L(K, r))Vav(Y) exp [-nL'(K, n)] 



where L'(K,rj) is positive. 



□ 



7.3 Proof of Theorem 13.41 and Proposition 13.51 



Proof of Theorem \3.4\ This proof follows the same approach as the one of Theorem 13.11 We shall 
only emphasize the differences with this previous proof. The bound (|29|) still holds. Let us respec- 
tively define the three constants m, K2 and v(K) as 



(k-i) [i - Vv] 2 [i - Vv - v(K)]\ 

- \-^-v(K) ' * 16 A1 

( 3 V /6 

:= A . 

V ; \K + 2J 2 
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We also introduce the random variables A m i and B m i for any model m' € M.. 



Am' 



B, t 



«i +1 



IIP- e - 



m'^m' lira 



l{9 m ',6) 



K2rupm ax [(Z*„,Z r . 



.„ ||n m ,(e + e m ,)|| 2 



A" 



2 ||n„ 



e m')lln 



n — d m > l(9 m >,9)+a 2 



n m' e L 



1 a 2 /( 



K 2 mp n 



ln m /(e + e m >)|| 2 
l(9 m .,9) + a 2 



K- 



dm' 



n — d r , 



2 \\Ui,(e + e ml )\\l 



u 



1 + 

The bound given in Lemma 17,51 clearly extends to 

-7nW - ln{6)pen{m) - a 2 + ||e|| 2 < 1(9, 9) [A m V (1 - k 2 )] + a 2 B ffl . 

As previously, we control the variable Afh on an event of large probability fii and take the expec- 
tation of Bfn on an event of large probability Hi fl f?2- 

Lemma 7.10. Let VL\ be the event 



< (X-l)(l-^-^))' 



w/iere s(if, 77) «'s a function smaller than one. Then, V (fij) < L(A")nexp [— nL'(K, 77)] L'(K, 77) > 
0. 

The function s(K,r]) is given explicitly in the proof of Lemma \7. 101 

Lemma 7.11. Let us assume that n is larger than some quantities tiq(K). Then, there exists an 
event f^2 of probability larger than 1 — exp [— nL(K, 77)] where L(K, rf) > such that 



• [-Bmlfiinn 2 ] < 



L(K, V ) 



Gathering inequalities (|29|) . (J32J) , Lemma ETO] and ETTJ we obtain as on the previous proof that 



E 



1, o)Ln 1 nn 2 



< L(K, 77) inf [l(9 m , 9) + (a 2 + l(6 m , 9)) pen(m)] + 



l'(k, v ) 



— + (a 2 + /(0 P , 9)) n exp [-nL"(K , 77)] 



(42) 



Afterwards, we control the loss of the estimator 9 on the event of small probability f2J U f^- 
Lemma 7.12. If n is larger than some quantity n$(K), 

E [/(M)lQ f unJ < n 5/2 ((T 2 + U0 P ,e))L(A,77)exp[-nL'(A',77)] , 

where L(K, rf) is positive. 
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Gathering this last bound with l142[) enables to conclude. 



□ 



Proof of Lemma \7.10\ This proof is analogous to the proof of Lemma 17,61 except that we shall 
change the weights in the concentration inequalities in order to take into account the complexity 
of the collection of models. Let x be a positive number we shall fix later. Applying Lemma 17.21 
Lemma 1731 and Lemma EH ensures that there exists an event Q[ such that 

P(fi' 1 c ) < 4exp(-na;) ^ exp[-d m H(d m )] , 



meM 



and for all models m £ M, 



in-t-e II 2 



II m (e + e r 

<7 2 +K0 m 



|n^(e + e 



i" l\\n 



a 2 + l( 



nip n 



(z^z m )- 



> 



< 



> 



< 




2xn 
n — gL 



VO 



1- (1 + V2#(dm)j \l — V2x ] V 



2xn 
n — d n 

-I -2 



VO 



(43) 
(44) 
(45) 



We recall that Sd is defined in ([28)1 . Besides, it holds that 

n 

P(^i c ) < 4exp[-nx] ^ Card [{meM, d m = d}] exp[-dH(d)] < 4ncxp[-r 



d=0 



By Assumption (H^ i?? ), the expression ^1 + y / 2H(d m )j y ^ is bounded by y/fj. Hence, condi- 
tionally on Q'i, 

-2 

; rf a - J I t — * \ 

Constraining x to be smaller than 



< 



nK 2 <p„ 



l-s/rj- V2a?J V 
- — g — ensures that 

(^-i)(i-V^-K^)) 2 



( z a z ™) 



In' < 



By assumption (Mk, v ), the dimension of any model meM is smaller than n/2. If n is larger than 
some quantities only depending on K, then S n /2 is smaller than v{K). Let us assume first that this 
is the case. We recall that v(K) is defined at the beginning of the proof of Theorem 13.41 Since 
v(K) < 1 — ^/rj, inequality (|43l> becomes 
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Bounding analogously the remaining terms of A m , we get 



< /si + 1 - [1- yfn- S n/2 ] 2 + -^(1 - y/rj- S^fUt + yGU 2 + xU 3 



where U\, U 2l and U 3 are respectively defined as 

2 



-K 
2V% 



1 + v/2iJ(d ffi )j + 1 + (K - l)/2 [1 + 
1 + Krj] 



< 



Since Ui is non-positive, we obtain an upper bound of A m that does not depend anymore on m. 
By assumption (Ik,,), we know that r\ < (1 — ^(-ftT) — {j^pz) 1 ^ 6 ) 2 ■ Hence, coming back to the 
definition of Ki allows to prove that Ki is strictly smaller than [1 — ^Jr\ — v(K)] 2 . Setting 



[l-^-u{K)} 2 



-1 2 



we get 



on the event Oj . 



4C7 2 



Afh < 1 — - 



< 1 



In order to take into account the case S n / 2 > v{K), we only have to choose a large constant 
L{K) in the upper bound of P(fif). □ 



Proo/ 0/ Lemma Yl.ll\ Once again, the sketch of the proof closely follows the proof of Lemma 17.111 
Let us consider the random variables E m defined as 



-1 ^m' e "i')n 

<J 2 HO m >,0) 



in. 



Since n|je|| 2 /<r 2 follows a \ 2 distribution with n degrees of freedom, there exists an event fl 2 of 
probability larger than 1 — exp [— nL(_ftT)]such that ||e|| \ /a 2 is smaller than n^ 1 = \J{K + 2)/3[l — 
y/rj — v(K)] on Vl 2 . The constant L{K)\n the exponential is positive. We shall simultaneously 



upper bound the deviations of the random variables E m , ^g^^+JJ 



and ii^f^'-F" . Let £ be 



2 +l{9 m ,8) 

some positive constant that we shall fix later. For any x > 0, we define an event F(x) such that 
conditionally on F(x) fl ^2, 



E m 

||n m (e+e m )||; 

l|n^ m +e||^ 

a 2 +l(8 m ,9) 



< 
+ 

< I 
— n 

> 



o. -2 £(d m +H(d m ))+x 
1 n 



d m + 2\Jd m [d m (je + H{d m )) +x]+2 + H{d m )) + x] 

n 2 

n—d. 



1 " $n- 



d m (l+2H(d m )) 

71 — dm 



n—d n 



vo 



RR n" 6616 



34 



Verzelen 



for any model msM. Then, the probability of F(x) satisfies 



F pF(ar) 



< e" 



e" 



^ exp [-d m fr(d rn )] (e « d -+e ^ + e ^) 



m£.M 
1 



+ 



+ 



l_e-i l- e -i/i« l-e-V2 y 
Let us expand the three deviation bounds thanks to the inequality 2ab < to 2 + r~ l b 2 : 



Em < 



l + 2y/l + 2k- 2 £ + + t 2 | + - [2 K - 2 + t- 1 + n] 



+ ^-[i + -r 1 «r 2 ] + 



i -2i d m H(d m ) 



n 



[2ki 2 + ti] + 2 



d m sj H{d m ) 



< 



^ f 1 + ^/2H(d^j) 2 \k^ 2 + 2^1+ 2/^ 2 £ + n£ + r 2 
n V / L 



n 

+ ^[2 Kr 2 + r 2 - 1 +r 1 ]+^[l + rr 1 «r 2 ] . 



Similarly, we get 



n m (e + e m )\ 
l(6 m ,6)+a 2 



< 2- 



n 



1 + V2H{d m ) 



If n is larger than some quantity no(K), then <5„/ 2 is smaller than v(K). Applying Assumption 

(Hjf.r,), we get 



n — d m \ / £(0™ 



< 



Z(0 m ,0)+a 2 

d r „ , \ 2 

n 



(i + v^(^)) ! 



l-V^-i/(A-) 



2x 



n — d„ 



VO 



< -K^i (l + ^2H(d m )Y [(l-^-^)) 2 - T3 l +2KTJTZ 1 - . 
n \ / L J n 

Let us combine these three bounds with the definitions of B m , K\, and k 2 . Hence, Conditionally to 
the event S7i n tt 2 n F(x), 



1 + y/2H(m) 



f/i + -(7 2 + 

n 



c/ 3 



(46) 



where 



c/ 3 



= - *=i (1 - ^ - iz(A-))' + tfr 3 + 2V? + 2«r 2 e + n£ + r 2 
= 1 + rf 1 . 



Since if > 1, there exists a suitable choice of the constants £, ti, and r 2 , only depending on if and 
r\ that constrains f/i to be non positive. Hence, conditionally on the event fix n Sl 2 n F(x), 

n n 
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Since P [F(x) c ] < e X L(K, r/), we conclude by integrating the last expression with respect to x. 

□ 

Proof of Lemma \7.12\ As in the ordered selection case, we apply Cauchy-Schwarz inequality 

E [i(M)ln;uns] < ^(fif) + P(fi§)^E 

However, there are too many models to bound efficiently the risk of 9 by the sum of the risks of the 
estimators 9 m . This is why we use here Holder's inequality 



E 



J(M)ln f un s 



< L(K)Vnexp[-nL(K,7])} 




< L(K)y/nexp[-nL(K,7])} J P ( m = ™) 1/Me K^m,0) 2v ,(47) 

V mGM 

where v := and u =: -^zj. We assume here that n is larger than 8. For any model m € Ai, 

the loss l(6 m , 9) decomposes into the sum l(9 m , 9) + l(9 m , 9 m ). Hence, we obtain the following upper 
bound by applying Minkowski's inequality 



E 



\2v 



1/2V < l{0 m , 9) +E \l(9 m , 9 m f v ] 1/2V < Var(F) + E 



2l: 



l/2v 



(48) 



We shall upper bound this last term thanks to Proposition 17.81 Since v is smaller than n/8 and 
since d m is smaller than n/2, it follows that for any model m € M., n — d m — Av + 1 is positive and 



E 



l(9 m ,9)) , 

for any model to 6 M. S ince dm 71 and since cr 2 -t- l{9rni 9~) Vf-ir(Y') . we obtain 

-1 i/2u 



E 



< 2wWVar(F) . 



Gathering upper bounds 03, (0HJ, and {49]) we get 



E 



1(6,0)1 



n?uns 



< L(K)y/nexp [—nL'(K, rj)} 



[Var(F) + 2uLn 2 Var(Y~)] / ^ P ( TO = TO) 



l/» 



(49) 



Since the sum over to S M. of P (to = to) is one, the last term of the previous expression is maximized 
when every P(m = m) equals gr^r/m ■ Hence, 

i(M)ln«un«l < n 5 / 2 Var(r)i(^,? 7 )Card(X) 1/(2 ' i;) exp[-n J L'(X,77)] , 
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where L'{K,rf) is positive. Let us first bound the cardinality of the collection M.. We recall that 
the dimension of any model m 6 M. is assumed to be smaller than n/2 by (Mk, v )- Besides, for any 
d E {1, . . . , n/2}, there are less than exp(dH(d)) models of dimension d. Hence, 

log (Caxd(M)) < log(n) + sup dH(d) . 

d=l,...,n/2 

By assumption (Hx jT/ ), dH(d) is smaller than n/2. Thus, log(Card(.M)) < log(?i) + n/2 and it 
follows that Card(.M) 1 /( 2l '" ) is smaller than an universal constant providing that n is larger than 8. 
All in all, we get 



E 



L f2f UOS 



< n 5/2 Var{Y)L(K, n) exp [-nL'(K, n)} 



where L'(K,n) is positive. □ 

Proof of Proposition \3.5[ We apply the same arguments as in the proof of Theorem [3]H except that 
we replace H(d m ) by l m . 



A, 



K 



1 + \/2l 



in 1 - e -II 2 

11 m /t: " _\\n 
2 d 



K 2 n<Anax [(Z m 'Z m ') 
lln^e + e™,)!! 2 



|n ro ,(e- 



-m')lln 



Z(0 m ,,0)+a 2 



n — d m ' H 



± .e A 2 



m' ; 1 
2 



|n»n' e ||rj 



K 2 mp n 



jn m /(e - 



l(9m>,d)+<J 2 



K- 



dm' 



n - d m i 



1 



'21' 



2 ||n^,(6 + e m Q||: 

l(9 m ,,6)+<j 2 



In fact, Lemma T7. 10^ EH] and 17.121 are still valid for this penalty. The previous proofs of these 
three lemma depend on the quantity H(d m ) through the properties: 



H{d m ) satisfies assumption (%^) and 



d m =d eX P(- dH ( d ™)) ^ L 



Under the assumptions of Proposition I3.5[ l m satisfies the corresponding Assumption (M. K „) 
and is such that J2meM d m =d ex P(~dl m )) < 1. Hence, the proofs of these lemma remain valid in 
this setting if we replace H(d m ) by l m . 

There is only one small difference at the end of the proof of Lemma 17.121 when bounding 
log (Card(vW)). By definition of l m , 

Card(A / J) — 1 < sup exp(d m Z TO ) . 

meM\{$} 

Hence, log(Card(.M) < 1 + sup mg _ M ^ } dml m , which is smaller than l+n/2 by Assumption (M l K ). 
Hence, the upper bound shown in the proof of Lemma 17.121 is still valid. 

□ 
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7.4 Proof of Proposition 17.81 

Proof of Proposition 1 7. 81 Let to be a subset of {1, ... ,p}. Thanks to (|27|l . we know that 

) Z,*„(e + e m ) . 

Applying Cauchy-Schwarz inequality, we decompose the r-th loss of 8 m in two terms 



E 



l(0m,0 m ) r 



< E 



I (e + e m ) (e + e m )* f„ ||Z m (Z^Z m )- 2 ZJJJ, 



< E[\\(e + e m )(e + e m )*\\ r F ]~E\tr (Z* m Z m ) 



(50) 



by independence of e, e m , and Z m . Here, \\.\\f stands for the Frobenius norm in the space of square 
matrices. We shall successively upper bound the two terms involved in (|50)l . 



|(e + e m ) (e + e, ; 



53 (e + em) W 2 (e + «m) b? 

l<i ;< 7<n 



r/2 



This last expression corresponds to the L r / 2 norm of a Gaussian chaos of order 4. By Theorem 
3.2.10 in [18J, such chaos satisfy a Khintchine-Kahane type inequality: 

Lemma 7.13. For all d € N i/iere exists a constant Ld € (0, oo) smc/i i/iai, «/ X ?'s a Gaussian 
chaos of order d with values in any normed space F with norm ||.|| and ifl<s<q<oo, then 



(E\\X\\ q )« <L d 



9-1 
s- 1 



d/2 



Let us assume that r is larger than four. Applying the last lemma with d = 4, q = r/2, and 
s = 2 yields 

2 f , i 

E[||(e + e m )(e + e m )*||y " < L 4 (r/2- 1) 2 E [||(e + e m ) (e + e m )*|| F J 2 . 
By standard Gaussian properties, we compute the fourth moment of this chaos and obtain 



E 



< Ln 2 [a 2 +l(9 m ,( 



|(e + e m ) (e + e m ) || 

Hence, we get the upper bound 

E [|| (e + em) {e + e m )*\\ r F Y < L(r - l)n[o 2 + l(6 m , 0)] . 
Straightforward computations allow to extend this bound to r = 2 and r = 3. 



(51) 



Let us turn to bounding the second term of f50|) . Since the eigenvalues of the matrix (Z*jZ m ) 
are almost surely non-negative, it follows that 



tr (z* m z m y 2 <tr (z;z m )^ 
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Consequently, we shall upper bound the r-th moment of the trace of an inverse standard Wishart 
matrix. For any couple of matrices A and B respectively of size p\ x qi and P2 X q% > we define the 
Kronecker product matrix A £g> B as the matrix of size P1P2 x qiq 2 that satisfies: 



A® B[i 2 +P2(h - + g2(ji - 1)] := A[i-i; 3x}B[i 2 ; j 2 ] , for any 



1 < h < Pi 

1 < «2 < P2 

l < h < qi 
1 < h < 12 



For any matrix A, ® k A refers to the fc-th power of A with respect to the Kronecker product. Since 
tr(A) k = tr (® k A) for any square matrix A, we obtain 

= tr[V{® k {Z* m Z m )- 1 )] 

thanks to Cauchy-Schwarz inequality. In Equation (4.2) of [371, Von Rosen has characterized 
recursively the expectation of <g> fe (Z*„Z m ) -1 as long as n — d m — 2k — 1 is positive: 



vec (E [® fc+1 (Z;Z m )- 1 ]) = A(n,d mi k)-\ec (E [®' : (Z' m Z m )- 1 ] ® I) 



(52) 



where 'vec' refers to the vectorized version of the matrix. See Section 2 of [37] for more details 
about this definition. A(n, d m ,k) is a symmetric matrix of size d^ 1 x d^ 1 which only depends 
on n, d m , and k and is known to be diagonally dominant. More precisely, any diagonal element 
of A(n, d m , k) is greater or equal to one plus the corresponding row sums of the absolute values of 
the off-diagonal elements. Hence, the matrix A is invertible and its smallest eigenvalue is larger or 
equal to one. Consequently, </? max (A^ 1 ) is smaller or equal to one. It then follows from (|52| that 

llE^+^Z™)- 1 ]^ = 



< 
< 



||vec (E [® k +\Z* m Z m )- l ])\\ F 
^(A- 1 ) 1 1 vec (E [^(Z^Z™)- 1 ] <g> I) 
^ n M[® k (Z m Z m )- l ]\\ w . 



By induction, we obtain 



if 



E[tr(Z* m Z m r 1 ] r <d r m , 
d m — 2r + 1 > 0. Combining upper bounds (|5l"T) and (|53|) enables to conclude 



(53) 



E 



)T < Lrd rn n(a 2 +l(0 m ,0)) 



□ 



7.5 Proof of Proposition 1^2] 

Proof of Proposition Let to* be the model that minimizes the loss function l(9 m ,9): 



arg 



inf l( 



INRIA 



Model selection on a Gaussian design 



39 



It is almost surely uniquely defined. Contrary to the oracle to*, the model to* is random. By 
definition of to, we derive that 

1(9,6) < l(9 m . , 9) + 7n (£m*)pen(m*) +7»(0 m J - ln(0)pen(m) - j n (6) , (54) 

where 7„ is defined in the proof of Theorem 13.11 The proof divides in two parts. First, we state 
that on an event fii of large probability, the dimensions of to. and of to* are moderate. Afterwards, 
we prove that on another event of large probability fli n fl 2 n f^, the ratio 1(6, 6)/l(6 m * , 6) is close 
to one. 

Lemma 7.14. Let us define the event Q,\ as: 

£li := \ log 2 (n) < d m , t < - and log 2 (n) < rf„ < 



log n log n 

The event Sli is achieved with large probability: P(f2i) > 1 — ■ 
Lemma 7.15. There exists an event SI2 of probability larger than 1 — L- 2 ^ such that 

-Ini®) ~ ln(6)pen(m) - a 2 + ||e|| 2 ln in fi 2 < 1(6,6)^(12), 

where T\(ri) is a positive sequence converging to zero when n goes to infinity. 
Lemma 7.16. There exists an event of probability larger than 1 — L- 2 ^ such that 

l n (6 m *) +^f n (6 m ')pen(m*) + a 2 - ||e|| 2 l ni nfi 3 < I \ m ,,9\ T 2 (n), 

where T2(n) is a positive sequence converging to zero when n goes to infinity. 

Gathering these three lemma, we derive from the upper bound (|54"j) the inequality 

1(6,6) l + r 2 (n) 

-J-r2insi 2 n03 ^ 



KOm.,0) l-n(n) 
which allows to conclude. 

□ 

Proof of Lemma \7.14\ Let us consider the model m^ s defined by d wlR3 := [(nR 2 )~ \. If n is 
larger than some quantity L(R,s), then d mR s is smaller than n/2 and to/j iS therefore belongs to 
the collection M y n /2\ ■ We shall prove that outside an event of small probability, the loss l(6 mRs , 6) 
is smaller than the loss 1(6 m , 6) of all models to e M.y n /2\ whose dimension is smaller than log 2 (n) 
or larger than j^^- Hence, the model to* satisfies log 2 (n) < d m , < with large probability. 

First, we need to upper bound the loss l(6 mR s ,6). Since l(0 mRg , 6) — l(0 mRg , 9)+l(6 mR s , 6 mR s ), 
it comes to upper bounding both the bias term and the variance term. Since 6 belongs to £' S (R), 

+00 



i(^ras, s ^) — ^ l(6m i ^ 1 ,6 mi ) 



i>d mR , s 



i>d„ 
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Then, we bound the variance term l(0 mRg , S mBl ) thanks to (|36| as in the proof of Lemma [731 



< [cr 2 + i(e mR , s ,e)]<p m ^ n{z* mR z mR>s )- 1 



\^rn RyS (t + e mH, s )||„ 

a 2 + l(9 mRs ,9) 



The two random variables involved in this last expression respectively follow (up to a factor n) 
the distribution of an inverse Wishart matrix with parameters (n 1 d mRs ) and a % 2 distribution 
with d mRg degrees of freedom. Thanks to Lemma \7.2\ and 1774^ we prove that outside an event of 

probability smaller than L(R, s) exp[— L'(R, s)n~ ] with L'(R,s) > 0, 



< 4[<j 2 + i(e„ lRs ,e)] 



if n is large enough. Gathering this last upper bound with lj55|) yields 

2" 



.R— 



R— 



< o 



3 C(R,a) 



n l + s 



(56) 



where C(R, s) is a constant that only depends on R and s. 

Let us prove that the bias term of any model of dimension smaller than log 2 (rt) is larger than 
([56)1 if n is large enough. Obviously, we only have to consider the model of dimension [log 2 (n)J. 
Assume that there exists an infinite increasing sequence of integers u n satisfying: 



E *(< 

i>log 2 (u„) 



< 



C(R, s) 



(57) 



Then, the sequence (v n ) defined by v n := log 2 (w n ) satisfies 



E'OWi'^) <C{R,s)exp 



-y/Vn+l 



1 + S 



Let us consider a subsequence of (v n ) such that [v n \ is strictly increasing. For the sake of simplicity 
we still call it v n . It follows that 



E 

4=K)J+1 



I 



+ 00 L^n + lJ 

E E 

n=0 i=[v n \-\-l 



rrii—i i ^rrii 



< C(i?, S )^K+iJ s 'exp 



VK+iJ- 



< oo , 



and 9 therefore belongs to some ellipsoid £ s * (R 1 ). This contradicts the assumption 9 does not belong 
to any ellipsoid £ S >(R'). As a consequence, there only exists a finite sequence of integers u n that 
satisfy Condition J57j. For n large enough, the bias term of any model of dimension less than 



log (n) is therefore larger than the loss l(9 mRiS ,9) with overwhelming probability. 
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Let us turn to the models of dimension larger than n/logn. We shall prove that with large 
probability, for any model m of dimension larger than n/ logn, the variance term l(8 m , 8 m ) is larger 
than the order a 2 / logn. For any model m £ M\ 



\0m 5 9rri) 



L[n/2J. 
T 2 



> 



|n m (e 



(Z* Tl Z. r 



(J 2 +l( 



The two random variables involved in this expression respectively follow (up to a factor n) a Wishart 
distribution with parameters (n,d m ) and a x 2 distribution with d m . Again, we apply Lemma [7721 
and [774] to control the deviations of these random variables. Hence, outside an event of probability 
smaller than L(£) exp[— n£/ logn], 



u 



Um + u„ x 



for any model m of dimension larger than n/logn. For any model m S A^l„/2J > the ratio d m /n is 
smaller than 1/2. As a consequence, we get 



0m) > ^ (l - 2v^) (l + y/T/2 + 



Choosing for instance £ = 1/16 ensures that for n large enough the loss l{0 m ,9 m ) is larger than 
l(0rn Rs ,8) for every model m of dimension larger than n/logn outside an event of probability 
smaller than L\ exp[— L^nj logn] + L^(R, s) exp[— L^(R, s)n 1 ^ 1+5 ^] with L^(R, s) > 0. 



Let us now turn to the selected model m. We shall prove that outside an event of small 
probability, 



In {8m Ri }j [1 +pen{m R . s )] < j n (j) m ) [1 + pen(m)} , 



(58) 



for all models m of dimension smaller than log 2 n or larger than n/logn. We first consider the 
models of dimension smaller than log 2 (n). For any model m e -Min/2] > 7n($m) * n/[cr 2 + l(9 m ,9)] 
follows a x 2 distribution with n — d m degrees of freedom. Again, we apply Lemma 17.21 Hence, 
with probability larger than 1 — e/[n 2 (e — 1)], the following upper bound holds for any model m of 
dimension smaller than log 2 (n). 



7n ( 0m) [1 + pen(m)] > a 2 



n — riL 



- d m ) (d m + 2 log(n)) 



' d m + 2 log(n) 



1-4 



logn 
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for n large enough. Besides, outside an event of probability smaller than -jy, 



In \ 9m R , a J [1 +pen(m_R, s )] 



< G 



1 + 2- 



n — cL 



2 V /(n-d mRi3 )21ogn | ^ logn 



](- 









1 + 2- 



\/21ogn | 4 logn 



For ro large enough, g? TOr s is smaller than and the last upper bound becomes: 



In (SmR^J [1 +pen(m R ^ s )] < a 

Hence, 7„ (0 



1 



C(i?, s) 



1 + 10 



log(n) 



1 +pen(m RtS )} < j n (6 m ) [1 + pen(m)} if 
/((9 



> 3 C (^; S ) x l + 101og(")/VH + H log(n) 
~ n iTi ~ 1 - 41og(n)/ % /n >/n 



As previously, this inequality always holds except for a finite number of n, since 9 does not belong to 
any ellipsoid £ s i (R'). Thus, outside an event of probability smaller than -4-, rfm is larger than log 2 n. 



Let us now turn to the models of large dimension. Inequality lf58| holds if the quantity 



2d n 



4||n m e||Ui+ 



n — d v 



2d„ 



n — d„ 



is non-positive. The three following bounds hold outside an event of probability smaller than 



HO . 



lell? > 1-4*5* 



|n m e|| 2 < (1 + £) — , for all models m of dimension d m > 

n 



logn 

^/(n - d mj?i Jlogn 4 logn 



yj{n- d mR s )l0g1 



Gathering these three inequalities we upper bound ([591 by 

' n + d 



2 



2 + 8,/^ + (l + 



2fX 



2 "tor,. 



+a z L 1 



4 a A f l(0 mR ..,e) | V*(fl mfl ,.,flA / 1+ / logn 
n / y a 2 ' cr y n - d mR 
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The dimension of any model m £ M.\ n /2\ is assumed to be smaller than n/2 and the dimensions of 
the models m considered are larger than j^j. For £ small enough and n large enough, the previous 
expression is therefore upper bounded by 



logn 



-(1 + 0-2- 



log 71 



R— 



n l + s 



R— 



(60) 



For n large enough, this last quantity is clearly non-positive. 

All in all, we have proved that for n large enough outside an event of probability smaller than 



L(R,s) 
n 1 - ' 



2 ^ 

log (n) < dm, < 



logn 



2 ^ 

and log (n) < dfh < 



logn 



□ 

Proof of Lemma \7.15\ Arguing as in the proof of Theorem I3.1[ we upper bound 

- 7 n (0) " ln(0)pen(m) + a 2 + \\ef n < l(6 m , 6)A €l + a 2 B ffl + (1 - K 2 (n))l(6, 9 m ) , (61) 

where and B„ are respectively defined in (|30l) and in IpTj) . We will fix the quantities k% (n) and 
K2(n) later. Besides, we define and bound the quantity E m as in (fUj) . 

Applying Lemma 17.21 and Lemma 17.41 and arguing as in the proofs of Lemma 17.61 and Lemma 
17.71 there exists an event Q2 of large probability 



P(fii) < exp[-n/8] + 5 ^ exp 

d— log 2 (n) 

and such that conditionally on Oi D O2, 



2d 
log n 



< exp [-n/8] + 



5 logn 



2n 2 (l - 1/logra) 



lllie- II 2 



K 

\U m (e + e 



2 

m)\\n 



a 2 + l(6 m ,6) 

|ni(e + esi )|| 2 



a 2 + l{9* 



> 



< 



> 



< 



n - d,n _ ^ y/2(n- d^dffjlogn 



' HI 



2y/2d 



n n^/log n nlogn' 
n — df. 



- 2 



y/2(n - d m )d ffi / logn 



11 



11 



< n 



1-1 



logn 



< 2 

dfh + 2k^ 1 (ti) + 2 
n n 



d r7l + (2/Si \n)Y 



2d fh 
logn 



nlogn 
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Gathering these six upper bounds, we are able to upper bound A m and B n 



din i d m 



A m < Ki(n) + Li\ — 1 

1/ n log n n 



-1 + L 2 a 



(n - d m )logn 



K 2 {n) 



1 + L 3 J Vlog(n) 



.! 2 



1 _ (1 + / * ) J§BL) 
\ A/ log n / V n / 



B m < — 

n 



-1 + Lu 



dfh , s 1 + L 2 /y/\og(n) 

+ K 2 {n)- 



(n - d„) log n 



k x x (n) k 1 (n) 1 



+ 



^rV) 



dfh logn V lo g( n ) V lo S( n ) d " 

Conditionally to the event fii, the dimension of fh is moderate. Setting m to jd^j, we get 



logn ?i 



-1 + + K 2 (n} 
logn 



Bih < — 
n 



-1 + : h «2(n)- — 



1 - 



L 2 



L4 



\/ lo s( r ' 



£4 



logn 



1 - 



Vlogn 



Hence, there exists a sequence «2(«) converging to one such that conditionally on fl\ (~]fl 2 , B m is 
non-positive and A m is bounded by when n is large enough. Coming back to the inequality 
dSU yields 



-7„(0) ~ ln(e)pen(m) - a 2 + \\e\\ 2 n l ni nn 2 < 1(0, 9) 



lop 



V(l-« a (n)) 



□ 



which concludes the proof. 
Proof of Lemma \7.16[ We follow a similar approach to the previous proof. 

)pen(m*) + a 2 - \\e\\ n < C m ,l(O m , , 0) + D„ u a 2 + K 2 (ri)l(6 m „ , m ,), (62) 
where for any model ml € M. [ n / 2 \ , C m i and D m > are respectively defined as 



C m ' = Ki(n) + 



-1 + 2 



dm' l|n^(e + e m /)||2 



l{9 m ;8) n-d m . l{9 m >,6)+cj 2 



(\ + K 2 (n)) 



|n m (e -f- £ m /)|| n 



(ftnax (Z^/Z«i') l(9m>id)+V 2 



n - K -i^< n m' e > n ™' e ™'>« ll n -' £ lln 



(1 + K 2 (n)) 



n ||n m ,(e + e m 0||2 u 2 _ d m , \\Il m ,(e + e m ,)\\ 2 n 



<p max (Z* m ,Z m> ) l{e rn ,,6) + a 2 n-d m , l(6 m ,,6)+a 2 
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We fix Ki(n) = 1/ log 77 whereas Kiln) will be fixed later. Arguing as in the proof of Lemma T7.15[ 
there exists an event of large probability 



P(fi§) < exp [-n/8] + 5 ^ exp 

d— log 2 (n) 



2d 



log 77 



< exp [— nj 



5 log 77 



2n 2 (l- l/log(n)) ' 



such that conditionally on f2i (~l Ct 3 , the two following bounds hold: 



Cm. 5: 



log 77 



n 



77 



1 + 



l + r -=--(l + /sa(n))- 



log n 



logn 
L 2 



1 



/log n 



log 77 \/log n 



(l + «2(n))- 



log n 



if n is large. The main difference with the proof of Lemma fr. 151 lies in the fact that we now control 
the largest eigenvalue of Z^Z m thanks to the second result of Lemma EH There exists a sequence 
K2(n) converging to such that conditionally on f^i n O3, D m » is non-positive and C m , is bounded 
by when n is large. Coming back to (|6Tj) yields 



7n(^m .) + pen(m*) + a 2 - ||e||* lniun a <K^m,^) 
which concludes the proof. 



logn 



V K 2 (n) 



□ 



7.6 Proof of Proposition 15731 

Proof of Propositio ns. 31 The approach is similar to the proof of Proposition 1 in Q|. For any 
model 771 e .M |_ rl /2j > let us define 

A(m,m L „ /2j ) := j n (^m L „ /2J ) [1 + pen(m L „ /2 j )] - 7„ (0 m ) [l+pen(m)] . 

We shall prove that with large probability the quantity A(m, mi„/2j ) is negative for any model m 
of dimension smaller than n/4. Hence, with large probability c? m will be larger than n/4. Let us 
fix a model m of dimension smaller than n/4. 

First, we use Expression 1(25)1 to lower bound "f n {9 m ). 

Tn f$ro^ = ||n»n ( e ~t~ e "»L"/ 2 J ) II" ll^m ( e »« — e "H"/2J ) lira + 2(LT m (e + Cm L „/ 2 j ) > I^m ( e ™ — e «H> 

> IIIrMe + e )\\ 2 ~/n ± (e + e ) ^ ^— - £m[ " /2J i \ 
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since 2ab > —a 2 — 6 2 for any number a and 6. Hence, we may upper bound A(m, my n /2\ ) by 



A(m,TO L „ /2j ) < ||n^ Ln/2j (e + e TOLn/2J ) \\ 2 n [pen(m [n/2i ) - pen(m)] 



6 

[n m - n mLn/3J ] (e + e„ Hn/2J ) [1 +pen(m)] 



n 1 - fe 



-m L „/2j , 



H»/3J ) ' 



(e m 



[1 + pen(m)] 



(63) 



Arguing as the proof of Lemma l2~Tl we observe that ||n^L 



)] 

follows a x 2 distribution with n — [ n /2j degrees of freedom. Analogously, the random variable 



/2j l e + e ™L"/2j, 



| 2 *n/[a 2 +^ r , 



■L«/2J 



rn 1 - - n- 1 

"l[n/2J 



] (e + e m[re/2J ) || 2 * n/[a 2 + ^(#m L „ /2J )] follows a \ 2 distribution with (d n 



|n/2j 



dm) 



degrees of freedom. Let us turn to the distribution of the third term. Coming back to the definition 



of e m , we observe that 



y - xe m - (y - xe ri 



•L"/2J 



) = X( 



H"/2J - 



Hence, e m — e m[n/2) is both independent of X m and of e + em L „ /2J • Consequently, by conditioning 
and unconditioning, we conclude that the random variable defined in lj63|) follows (up to a [a 2 + 

u 



l L»/2J 



)]/n factor) a x distribution with 1 degree of freedom. 



Once again, we apply Lemma lT^l and the classical deviation bound P (|A/"(0, 1)| > ^/2x) < 2e~ x . 
Let x be some positive number smaller than one that we shall fix later. There exists an event fl x of 
probability larger than 1 — exp(— nx/2) — 3exp(— (n/4— l)x) such for any model of dimension 

smaller than n/4, 



A(to,to l „ /2J ) 



a 2 +l( 



'm Ln/2J 



< 



n - |n/2j 
n 

|_n/2j - d m 



1 + 2\fx + 2x) (pen(m^ n /2j ) — pen(m)) 
(1 - 2y/x - 2x)(l + pen(m)) . 



We now replace the penalty terms by their values thanks to Assumption ifTTj) . Conditionally to £l x , 
we obtain that 



A(771,TO L „/2j) 



a 2 + /( 



'"H»/2J 



) 



< 



Ln/2j 



4(l-i/)(Vas + a?) 



1 



- - 2-y/i- 2x) 



Since the dimension of the model m is smaller than n/4, 
upper bound becomes 



is smaller than 1/3. Hence, the last 



A(m,m|„/2j) 
a 2 + 1(6. 



< 



[n/2\ 



m L „ /2 j, 



16 

— (1 - v)(y/x + x) - u(l - 2y/x~ - 2x) 



There exists some x(v) such that conditionally on £l x / v ), A(ni, my n / 2 \ ) is negative for any model 
m of dimension smaller than n/4. Since P(17^ [/ ^) goes exponentially fast with v to 0, there exists 
some n$(v, S) such that for any n larger than tiq(v, 6), is smaller than 5. We have proved 

that with probability larger than 1 — 8, the dimension of m is larger than n/4. 
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Let us simultaneously lower bound the loss l(6 m , 9 m ) for every model m 6 M of dimension larger 
than n/4. In the sequel, > means "stochastically larger than". Thanks to (|27| . we stochastically 
lower bound l{6 m , 9 m ) 

l{9 m ,9 m ) > ^max(Z; i Z m )~ 1 ||n m (e + e m )||^ 



h Vmax (nZ* m Z m ) 1 ||II m e| 



2 



where Z^Z m follows a standard Wishart distribution with parameters (n,d m ). Applying Lemma 
17.21 and Lemma [7741 in order to simultaneously lower bound the loss l(9 m ,9 m ), we find an event Q! 
of probability larger than 1 — 2 r^riT^r > sucn that 



1(0 



m, »m Iff 



> 



16n 



2 \ ^ m 2 
— C - > T. — CT 

2n 



8n 



for any model meMof dimension larger than n/4. On the event Sl x ( v ), the dimension dm is larger 
than n/4. As a consequence, ^m)ln , na ;c( „ ) > fj- in all, we obtain 



E 



KM) 



> «( 

> i( 

> U 



Hn/2j > 



H"/2J ' 



'"Hi/2J ' 



1-P(^ M )-P(tt' c ) 



+ L((5,^)cr 2 , 



<7 

32 



if n is larger than some no(V, <5). 



□ 



7.7 Proofs of the minimax lower bounds 

All these minimax lower bounds are based on Birge's version of Fano's Lemma 0|. 

Lemma 7.17. (Birge's Lemma) Let (O, d) be some pseudo-metric space and {Pg,9 S 0} fee 
some statistical model. Let k denote some absolute constant smaller than one. Then for any 
estimator 9 and any finite subset Oi of O, setting S — ming,g> ( zQ li g^g>d(9,9 l ), provided that 
maxg^/gOi /C (Pe , Pg') < ftlog|6i|, the following lower bound holds for every p > I, 

sup Eg[d p {9,9)} >2- p S p {l-K) . 

First, we compute the Kullback-Leibler divergence between the distribution Fg and Pg>. 

AC (PfljPo/) = £ (P e (X);P e ,(X)) + E e [£ (P e (F|X); F g ,(Y\X)) \ X] 

The two marginal distributions Pg(X) and P#'(A) are equal. The conditional distributions Fg(Y\X) 
and Pe'(F|X) are Gaussian with variance a 2 and with mean respectively equal to X9 and X& . 
Hence, the conditional Kullback-Leibler divergence equals 

JC(¥g(Y\X);Fg,(Y\X)) = . 
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Reintegrating with respect to X yields 

1(9' , 9) 



^ and /C(Pf«;Pr)=n^ 



(64) 



Proof of Proposition \4-l\ First, we need a lower bound of the minimax rate of estimation on a 
subspace of dimension D. 

Lemma 7.18. Let D be some positive number smaller than p and r be some arbitrary positive 
number. Let Sd be the set of vectors in W whose support in included in {1, . . . , D}. Then, for any 
estimator 9 of 9, 



sup Ee 

6£Sn,l(0 p ,8)<Dr 2 



' 1(9,9) 


> LD 


\ 2 O 3 ] 

r A — 






n 



(65) 



Let us fix some D G {1, . . . ,p}. Consider the set 0_d := {9 e SdJ(0 p ,9) < ajjR 2 }. Since the 
a/s are non increasing, it holds that 

El{^m i - 1 ,9 mi ) ^ l(9 mi _ 1 , 9 mi ) l(Q p ,9) 2 
4 — < — ' 

i=\ 1 j=l D D 

for any 9 S 6_d. Hence 6_d is included in £ a {R)- Applying Lemma 17.181 we get 



inf sup > LD 

8 ee£ a (R) 



D A 



> L 



a 2 n R 2 A 



D n 
Da 2 



Taking the supremum over D in {1, . . . ,p} enables to conclude. 



□ 



Proof of Lemma Yl.l8\ Let us assume first that S = I p . Consider the hypercube C_o(r) := {0, r} D x 
{0} P ~ D . Thanks to |64|) . we upper bound the Kullback-Leibler divergence between the distributions 
Fg and ¥ g , 



/C(Pf";Pf") < 



nDr 2 



where 9 and & belong to Cu(r). Then, we apply Varshamov-Gilbert's lemma (e.g. Lemma 4.7 in 
to the set C D (r). 

Lemma 7.19 (Varshamov-Gilbert's lemma). Let {0, 1} D be equipped with Hamming distance dn ■ 
There exists some subset of {0, 1} D with the following properties 

d H (6, 9') > D/A for every [9, 9') € 6 2 with 9 j= 9' and log |0| > D/8 . 

Combining Lemma 17.171 with the set defined in the last lemma yields 

D 



inf sup Eg 

9 6eC D (r) 



> 



16 
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provided that " 2 \ < D/16. Coming back to the loss function /(., .) yields 



inf sup Eg 

e eec D (r) 



> LDr 2 



if r 2 < L 2 -. Finally, we get 



inf sup 

9 6eSr>, l(0 p ,8)<Dr 2 



1(0,6) 



> LD 



r z A 



If we no longer assume that the covariance matrix £ is the identity, we orthogonalize the sequence 
Xi thanks to Gram-Schmidt process. Applying the previous argument to this new sequence of 
covariates allows to conclude. □ 



Proof of Corollary \4-2[ This result follows from the upper bound on the risk of 6 in Theorem l3.1l and 
the minimax lower bound of Proposition 14.11 Let £ a (R) an ellipsoid satisfying ^- < R 2 < a 2 n" , 
then l(0 p ,9) is smaller than a 2 nP . By Theorem 13. 14 the estimator defined with the collection 

M |«/2jAp and pen(m) — K n l'g satisfies 



1(0,0) < UK) inf 

J l<i<L«/2jAp 



< UK, 8) inf 

l<i<|ra/2jAp 



K- 



n — i 

2 



[a 2 + l(6 nHl 0)} \+L(K,P) 



i 

— a 

n 



If belongs to S a {R), then 



< 



~2 S n, a i+1 



j=i+i 



since the (ai)'s are increasing. It follows that 



En 



< L(K,8) inf 

l<i<Ln/2jAp 



R 2 a 2 +1 



(66) 



Let us define i* :— sup |l < i < p , R 2 aj > with the convention sup = 0. Since R 2 > a 2 /n, 

i* is larger or equal to one. By Proposition 14.11 the minimax rates of estimation is lower bounded 
as follows 



inf sup Eg 

9 ees a (R) 







1(6,6) 


> L 







a 2 , +1 R 2 V 



> L 



r, 2 - r?2 



If either p < 2n or a2 n /2\+i^ 2 — °' 2 /2, then i* is smaller or equal to [n/2\ Ap and we obtain thanks 
to 1661) that 



En 



1(0,0) < L(K, 8) 



a 2 * +1 R 2 



< L(K,0)mf sup E 

9 e^£ a (R) 



n 

1(6, 



□ 
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Proof of Proposition \4-3\ First, we use f64|) to upper bound the Kullback-Leibler divergence be- 
tween the distributions corresponding to parameters 9 and 9' in the set Q[k,p](r) 



nkr 
2^ 



since the covariates are i.i.d standard Gaussian variables. Let us state a combinatorial argument 
due to Birge and Massart [7j. 

Lemma 7.20. Let {0, 1} P be equipped with Hamming distance dn and given 1 < k < p/A, define 
{0, 1}| := {x S {0, 1} P : d H (Q,x) — k}. There exists some subset <d of {0, 1}£ with the following 
properties 

d H {9, 9') > k/8 for every (9, 9') € 8 2 with 9^9' and log |6| > fc/5 log (| 

Suppose that k is smaller than p/A. Applying Lemma T7. 171 with Hamming distance dn and the 
set rO introduced in Lemma T7. 201 yields 



inf sup Kg 

6 0ee[k,p](r) 



IB 



^ ttt > provided that 
16 



nkr 2 



< 



(67) 



Since the covariates Xi are independent and of variance 1, the lower bound i[67|l is equivalent to 

kr 2 



inf sup Kg 

9 6ee[k,p](r) 



I 9,9 



> 



l(i 



All in all, we obtain 



inf sup Kg 

e 6ee[k, P ]{r) 



> Lk \ r A 



2 l0g(f) 



Since p/k is larger than 4, we obtain the desired lower bound by changing the constant L: 

I 



inf sup Kg 

9 eee[*,p](r) 



ti 2 1 + log f 2 
> Lk \ r A ^1M CT 2 



If p/k is smaller than 4, we know from the proof of Lemma T7.18[ that 

„2 ' 



inf sup Kg 

e eec h {r) 



> Lk \ r A 



We conclude by observing that log(p/fc) is smaller than log(4) and that Cfc(r) is included in Q[k,p](r). 

□ 

Proof of Proposition \4-5\ Assume first the covariates (Xi) have a unit variance. If this is not the 
case, then one only has to rescale them. By Condition lj22j) . the Kullback-Leibler divergence between 
the distributions corresponding to parameters 9 and 9' in the set Q[k,p](r) satisfies 



/C(Pf n ;Pf") < (1 + 5) 



, nkr 
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We recall that ||.|| refers to the canonical norm in M. p . Arguing as in the proof of Proposition I4.3[ 
we lower bound the risk of any estimator 9 with the loss function ||.||, 



Proof of Proposition \4-6[ In short, we find a subset <I> C {1, . . . , p} whose correlation matrix follows 
a 1/2-Restricted Isometry Property of size 2k. We then apply Proposition 14,51 with the subset $ of 
covariates. 

We first consider the correlation matrix $>i(uj). Let us pick a maximal subset $ C {1, . . .p} of 
points that are |~log(4fc)/w] spaced with respect to the toroidal distance. Hence, the cardinality 
of $ is ri°s(4fc) / ^] ~ 1 J • Assume that k is smaller than this quantity. We call C the correlation 
matrix of the points that belong to $. Obviously, for any € < & 2 , it holds that \C(i,j)\ < l/(4fc) 
if i 7^ j. Hence, any submatrix of C with size 2k is diagonally dominant and the sum of the absolute 
value of its non-diagonal elements is smaller than 1/2. Hence, the eigenvalues of any submatrix of 
C with size 2k lies between 1/2 and 3/2. The matrix C therefore follows a 1/2-Restricted Isometry 
Property of size 2k. Consequently, we may apply Proposition 14.51 with the subset of covariates $ 
and the result follows. The second case is handled similarly. 

Definition of the correlations 

Let us now justify why these correlations are well-defined when p is an odd integer. We shall 
prove that the matrices ^i(uj) and ^(t) are non-negative. Observe that these two matrices are 
symmetric and circulant. This means that there exists a family of numbers (ak)i<k<p such that 



Such matrices are known to be jointly diagonalizable in the same basis and their eigenvalues cor- 
respond to the discrete Fourier transform of (afe). More precisely, their eigenvalues (\i)i<i< p are 
expressed as 




Applying again Assumption f22|) allows to obtain the desired lower bound on the risk 




□ 



^!(iv)[i,j] = a^j mod p for any 1 < i,j < p . 




(68) 



We refer to ji?} Sect. 2.6.2 for more details. In the first example, at equals exp(— uj(k A (p — k)) 
whereas it equals [1 + (k A (p — fc))] _t in the second example. 
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CASE 1: Using the expression (|68| . one can compute A;. 

(p-1)/2 



A, 



(2itkl 



V p 



-1 + 2 ros 

k=0 

( (p-l)/2 

l + 2Re< ^ exp 



exp(— few) 



fc=0 



k(i2n a; 



-l + 2Re- 



1 - e 



-"^(-lj'e * 



1 - e 



1 - e"" cos 



= -1 + 2- 



+ e -"(P+i)/2(_!)i cos ^ ( e — - 1) 



1 + e- 2u - 2e~ u cos 



Hence, we obtain that 



A, > & 1 + 2e -^+ 1 )/ 2 (-l) i cos (-) {e-» - 1 



P J 



> 



It is sufficient to prove that 



1 - er 2uJ + 2e 



-u(p+3)/2 



2e 



-w(p+l)/2 



> . 



This last expression is non-negative if w equals zero and is increasing with respect to u>. We con- 
clude that A; is non-negative for any 1 < I < p. The matrix ^(w) is therefore non-negative and 
defines a correlation. 

CASE 2: Let us prove that the corresponding eigenvalues A/ are non-negative. 

'271-fcr 



(p-1)/2 
A; = -1 + 2 Y cos 
k=0 



p 



(k + iy 



Using the following identity 

(* + l)-* = 

we decompose A; into a sum of integrals 
1 



r(t) 



-r{k+l) r t-l dr 



A, 



r(t) J 



(P"l)/2 

-1 + 2 V" cos 

k=0 



2nkl 
P 



— rk 



dr 



The term inside the brackets corresponds to the eigenvalue for an exponential correlation with 
parameter r (CASE 1). This expression is therefore non-negative for any r > 0. In conclusion, the 
matrix ^(i) is non-negative and the correlation is defined. □ 
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Appendix 



Proof of Lemma\7Jl We recall that j n (9 m ) = ||Y - II m Y||£. Thanks to the definition J23| of e 
and e m , we obtain the first result. Let us turn to the mean squared error j(9 m ). In the following 
computation 9 m is considered as fixed and we only use that 9 m belongs to S m . By definition, 



E 



Y,X 



Y - X9. n 



E 



x 



X(9-9 m ) 



a 2 + l(9 m ,9)+l(9 m ,9 m ) 



since 9 m is the orthogonal projection of 9 with respect to the inner product associated to the loss 
£(.,.). We then derive that 



l\9 m , I 



E 



X \ 9 m — 9mj — \ 9 m — 9„^j E (9 m — 9 m ^j 



Since 9 m is the least-squares estimator of 9 m , it follows from (|23| that 

l{9 m ,9 m ) — (e + e m )*X m (X* ra X m ) 1 S m (X* ra X m ) 1 X^ l (e + e m ) 
We replace X m by Z m ^/T, m and therefore obtain 

l{9 m ,9 m ) = (e + e m )*Z m (Z* l Z m ) 2 Z* rl (e + e m ). 



□ 



Proof of Lemma[2Jl Thanks to Equation (|25| . we know that 7„(# m ) = ||II^(e + e m )||^. The 
variance of e + % is er 2 + /(#,„, 0). Since e + e m is independent of X m , 7„(# m ) * n/[cr 2 + l(9 m , 9)] 
follows a x 2 distribution with n — d m degrees of freedom^ and the result follows. 
Let us turn to the expectation of r y(9 rn ). By l(26|). ~f{9 m ) equals 



7 



a 2 + l(9 rn , 9) + (e + e ffi )* Z„(Z1 Z„)- 2 Z1 (e + e 



following the arguments of the proof of Lemma 1770 Since e + e m and X m are independent, one may 
integrate with respect to e + e m 



E 



7(M = [^ 2 + ^( 



{l+E[tr (Z^Z™)- 1 )]} 



where the last term it the expectation of the trace of an inverse standard Wishart matrix of pa- 
rameters (n,d m ). Thanks to [37j, we know that it equals ^ T "_ 1 . □ 



Proof of Lemma \7.3[ The random variable \/ X 2 (d) ma Y be interpreted as a Lipschitz function with 
constant 1 on M. d equipped with the standard Gaussian measure. Hence, we may apply the Gaussian 
concentration theorem (see e.g. [1H Th. 3.4). For any x > 0, 



(Vx 2 (d) < E y X 2 (d)] - Vto) < exp(-x) 



(69) 
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In order to conclude, we need to lower bound E 



Let us introduce the variable Z := 



X 2 (d) 



By definition, Z is smaller or equal to one. Hence, we upper bound E(Z) as 



E(z) < j p(z>t)dt< j p(z > t)dt + p(z > y-) 

Let us upper bound P(Z > t) for any < t < by applying Lemma 17721 

V(Z>t) < P(x 2 (d) < d[l-if) 



< P (x 2 (d) <d- 2Vd^Jdt 2 /2 s j < exp f-^- 
since t < 2 — y2- Gathering this upper bound with the previous inequality yields 



/ d\ f + °° ( dt 2 



< exp -— + J — 



1G 



2d 



Thus, we obtain E y-\fx 2 {d)j > \fd — v / rfexp(— d/16) — ^/ir/2. Combining this lower bound with 
(l69ll allows to conclude. □ 



Acknowledgements 

I gratefully thank Pascal Massart for many fruitful discussions. I also would like to thank the referee 
for his suggestions that led to an improvement of the paper. 



References 

[1] H. Akaike. Statistical predictor identification. Ann. Inst. Statist. Math., 22:203-217, 1970. 

[2] H. Akaike. A new look at the statistical model identification. IEEE Trans. Automatic Control, 
AC-19:716-723, 1974. System identification and time-series analysis. 

[3] S. Arlot. Model selection by resampling penalization, 2008. oai:hal.archives-ouvertes.fr:hal- 
00262478_vl. 

[4] Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown variance. Ann. 
Statist, 37(2):630-672, 2009. 

[5] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. 
Annals of Statistics (to appear), 2009. 

[6] L. Birge. A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory, 
51(4):1611-1615, 2005. 



INRIA 



Model selection on a Gaussian design 



55 



[7] L. Birge and P. Massart. Minimum contrast estimators on sieves: exponential bounds and 
rates of convergence. Bernoulli, 4(3):329-375, 1998. 

[8] L. Birge and P. Massart. Gaussian model selection. J. Eur. Math. Soc. (JEMS), 3(3):203-268, 
2001. 

[9] L. Birge and P. Massart. Minimal penalties for Gaussian model selection. Probab. Theory 
Related Fields, 138(1-2) :33-73, 2007. 

[10] F. Bunea, A. Tsybakov, and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist., 
35(4):1674-1697, 2007. 

[11] F. Bunea, A. Tsybakov, and M. Wegkamp. Sparsity oracle inequalities for the Lasso. Electron. 
J. Stat, 1:169-194 (electronic), 2007. 

[12] E. Candes and Y. Plan. Near-ideal model selection by l\ minimization. Ann. Statist, (to 
appear), 2009. 

[13] E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than 
n. Ann. Statist, 35(6):2313-2351, 2007. 

[14] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. Inform. Theory, 
51(12):4203-4215, 2005. 

[15] R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic networks 
and expert systems. Statistics for Engineering and Information Science. Springer- Verlag, New 
York, 1999. 

[16] N. A. C. Cressie. Statistics for spatial data. Wiley Series in Probability and Mathemati- 
cal Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New York, 1993. 
Revised reprint of the 1991 edition, A Wiley-Interscience Publication. 

[17] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and Banach spaces. In 
Handbook of the geometry of Banach spaces, Vol. I, pages 317-366. North-Holland, Amsterdam, 
2001. 

[18] V. H. de la Pena and E. Gine. Decoupling. Probability and its Applications (New York). 
Springer- Verlag, New York, 1999. From dependence to independence, Randomly stopped pro- 
cesses, [/-statistics and processes. Martingales and beyond. 

[19] C. Giraud. Estimation of Gaussian graphs by model selection. Electron. J. Stat., 2:542-563, 
2008. 

[20] T. Gneiting. Power-law correlations, related models for long-range dependence and their sim- 
ulation. J. Appl. Probab., 37(4):1104-1109, 2000. 

[21] M. Kalisch and P. Biihlmann. Estimating high-dimensional directed acyclic graphs with the 
PC-algorithm. J. Much. Learn. Res., 8:613-636, 2007. 

[22] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. 
Ann. Statist, 28(5):1302-1338, 2000. 



RR n° 6616 



56 



Verzelen 



[23] S. L. Lauritzen. Graphical models, volume 17 of Oxford Statistical Science Series. The Claren- 
don Press Oxford University Press, New York, 1996. Oxford Science Publications. 

[24] C.L. Mallows. Some comments on C p . Technometrics, 15:661-675, 1973. 

[25] P. Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in 
Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability 
Theory held in Saint-Flour, July 6-23, 2003, With a foreword by Jean Picard. 

[26] N. Meinshausen and P. Biihlmann. High-dimensional graphs and variable selection with the 
lasso. Ann. Statist, 34(3):1436-1462, 2006. 

[27] H. Rue and L. Held. Gaussian Markov Random Fields: Theory and Applications, volume 104 
of Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, London, 2005. 

[28] K. Sachs, O. Perez, D.Pe'er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling 
networks derived from multiparameter single-cell data. Science, 308:523-529, 2005. 

[29] J. Schafer and K. Strimmer. An empirical bayes approach to inferring large-scale gene associ- 
ation network. Bioinformatics, 21:754-764, 2005. 

[30] G. Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461-464, 1978. 

[31] R. Shibata. An optimal selection of regression variables. Biometrika, 68(l):45-54, 1981. 

[32] C. Stone. An asymptotically optimal histogram selection rule. In Proceedings of the Berke- 
ley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983), 
Wadsworth Statist. /Probab. Ser., pages 513-520, Belmont, CA, 1985. Wadsworth. 

[33] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 
58(l):267-288, 1996. 

[34] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 
58(l):267-288, 1996. 

[35] A. Tsybakov. Optimal rates of aggregation. In 16th Annual Conference on Learning Theory, 
volume 2777, pages 303-313. Springer- Verlag, 2003. 

[36] N. Verzelen and F. Villers. Goodness-of-fit tests for high-dimensional gaussian linear models. 
Ann. Statist, (to appear), 2009. 

[37] D. von Rosen. Moments for the inverted Wishart distribution. Scand. J. Statist., 15(2):97-109, 
1988. 

[38] M. J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional 
and noisy setting. Technical Report 725, Department of Statistics, UC Berkeley, 2007. 

[39] A. Wille, P. Zimmermann, E. Vranova, A. Fiirholz, O. Laule, S. Bleuler, L. Hennig, A. Prelic, 
P. von Rohr, L. Thiele, E. Zitzler, W. Gruissem, and P. Biihlmann. Sparse graphical Gaussian 
modelling of the isoprenoid gene network in arabidopsis thaliana. Genome Biology, 5(11), 2004. 

[40] P. Zhao and B. Yu. On model selection consistency of Lasso. J. Much. Learn. Res., 7:2541-2563, 
2006. 



INRIA 



Model selection on a Gaussian design 57 



[41] H. Zou. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc., 101(476):1418- 
1429, 2006. 



RR n° 6616 




Centre de recherche INRIA Saclay - Ile-de-France 
Pare Orsay Universite - ZAC des Vignes 
4, rue Jacques Monod - 91893 Orsay Cedex (France) 

Centre de recherche INRIA Bordeaux - Sud Ouest : Domaine Universitaire - 351, cours de la Liberation - 33405 Talence Cedex 
Centre de recherche INRIA Grenoble - Rhone-Alpes : 655, avenue de l'Europe - 38334 Montbonnot Saint-Ismier 
Centre de recherche INRIA Lille - Nord Europe : Pare Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d'Ascq 
Centre de recherche INRIA Nancy - Grand Est : LORIA, Technopole de Nancy-Brabois - Campus scientifique 
615, rue du Jardin Botanique - BP 101 - 54602 Villers-les-Nancy Cedex 
Centre de recherche INRIA Paris - Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex 
Centre de recherche INRIA Rennes - Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex 
Centre de recherche INRIA Sophia Antipolis - Mediterranee : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex 



Editeur 

INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France) 

http: / /www. inria.fr 
ISSN 0249-6399 




INFLIA 



WINRIA 



